Foo
Yes
A call to |libxml-parse-html-region| returns this DOM (document
object model):
(html nil
(head nil)
(body ((width . "101"))
(div ((class . "thing"))
"Foo"
(div nil
"Yes"))))
Function: *shr-insert-document* /dom/
This function renders the parsed HTML in dom into the current
buffer. The argument dom should be a list as generated by
|libxml-parse-html-region|. This function is, e.g., used by EWW
in The Emacs Web Wowser Manual.
Function: *libxml-parse-xml-region* /start end &optional base-url
discard-comments/
This function is the same as |libxml-parse-html-region|, except that
it parses the text as XML rather than HTML (so it is stricter about
syntax).
• Document Object Model <#Document-Object-Model> Access, manipulate
and search the DOM.
Up: Parsing HTML/XML <#Parsing-HTML_002fXML> [Contents
<#SEC_Contents>][Index <#Index>]
32.28.1 Document Object Model
The DOM returned by |libxml-parse-html-region| (and the other XML
parsing functions) is a tree structure where each node has a node name
(called a /tag/), and optional key/value /attribute/ list, and then a
list of /child nodes/. The child nodes are either strings or DOM objects.
(body ((width . "101"))
(div ((class . "thing"))
"Foo"
(div nil
"Yes")))
Function: *dom-node* /tag &optional attributes &rest children/
This function creates a DOM node of type tag. If given, attributes
should be a key/value pair list. If given, children should be DOM
nodes.
The following functions can be used to work with this structure. Each
function takes a DOM node, or a list of nodes. In the latter case, only
the first node in the list is used.
Simple accessors:
|dom-tag node|
Return the /tag/ (also called “node name”) of the node.
|dom-attr node attribute|
Return the value of attribute in the node. A common usage would be:
(dom-attr img 'href)
=> "https://fsf.org/logo.png"
|dom-children node|
Return all the children of the node.
|dom-non-text-children node|
Return all the non-string children of the node.
|dom-attributes node|
Return the key/value pair list of attributes of the node.
|dom-text node|
Return all the textual elements of the node as a concatenated string.
|dom-texts node|
Return all the textual elements of the node, as well as the textual
elements of all the children of the node, recursively, as a
concatenated string. This function also takes an optional separator
to be inserted between the textual elements.
|dom-parent dom node|
Return the parent of node in dom.
|dom-remove dom node|
Remove node from dom.
The following are functions for altering the DOM.
|dom-set-attribute node attribute value|
Set the attribute of the node to value.
|dom-append-child node child|
Append child as the last child of node.
|dom-add-child-before node child before|
Add child to node’s child list before the before node. If before is
|nil|, make child the first child.
|dom-set-attributes node attributes|
Replace all the attributes of the node with a new key/value list.
The following are functions for searching for elements in the DOM. They
all return lists of matching nodes.
|dom-by-tag dom tag|
Return all nodes in dom that are of type tag. A typical use would be:
(dom-by-tag dom 'td)
=> '((td ...) (td ...) (td ...))
|dom-by-class dom match|
Return all nodes in dom that have class names that match match,
which is a regular expression.
|dom-by-style dom style|
Return all nodes in dom that have styles that match match, which is
a regular expression.
|dom-by-id dom style|
Return all nodes in dom that have IDs that match match, which is a
regular expression.
|dom-search dom predicate|
Return all nodes in dom where predicate returns a non-|nil| value.
predicate is called with the node to be tested as its parameter.
|dom-strings dom|
Return all strings in dom.
Utility functions:
|dom-pp dom &optional remove-empty|
Pretty-print dom at point. If remove-empty, don’t print textual
nodes that just contain white-space.
Next: JSONRPC <#JSONRPC>, Previous: Parsing HTML/XML
<#Parsing-HTML_002fXML>, Up: Text <#Text> [Contents
<#SEC_Contents>][Index <#Index>]
32.29 Parsing and generating JSON values
When Emacs is compiled with JSON (/JavaScript Object Notation/) support,
it provides several functions to convert between Lisp objects and JSON
values. Any JSON value can be converted to a Lisp object, but not vice
versa. Specifically:
* JSON uses three keywords: |true|, |null|, |false|. |true| is
represented by the symbol |t|. By default, the remaining two are
represented, respectively, by the symbols |:null| and |:false|.
* JSON only has floating-point numbers. They can represent both Lisp
integers and Lisp floating-point numbers.
* JSON strings are always Unicode strings encoded in UTF-8. Lisp
strings can contain non-Unicode characters.
* JSON has only one sequence type, the array. JSON arrays are
represented using Lisp vectors.
* JSON has only one map type, the object. JSON objects are represented
using Lisp hashtables, alists or plists. When an alist or plist
contains several elements with the same key, Emacs uses only the
first element for serialization, in accordance with the behavior of
|assq|.
Note that |nil|, being both a valid alist and a valid plist, represents
|{}|, the empty JSON object; not |null|, |false|, or an empty array, all
of which are different JSON values.
If some Lisp object can’t be represented in JSON, the serialization
functions will signal an error of type |wrong-type-argument|. The
parsing functions can also signal the following errors:
|json-end-of-file|
Signaled when encountering a premature end of the input text.
|json-trailing-content|
Signaled when encountering unexpected input after the first JSON
object parsed.
|json-parse-error|
Signaled when encountering invalid JSON syntax.
Only top-level values (arrays and objects) can be serialized to JSON.
The subobjects within these top-level values can be of any type.
Likewise, the parsing functions will only return vectors, hashtables,
alists, and plists.
Function: *json-serialize* /object &rest args/
This function returns a new Lisp string which contains the JSON
representation of object. The argument args is a list of
keyword/argument pairs. The following keywords are accepted:
|:null-object|
The value decides which Lisp object to use to represent the JSON
keyword |null|. It defaults to the symbol |:null|.
|:false-object|
The value decides which Lisp object to use to represent the JSON
keyword |false|. It defaults to the symbol |:false|.
Function: *json-insert* /object &rest args/
This function inserts the JSON representation of object into the
current buffer before point. The argument args are interpreted as in
|json-parse-string|.
Function: *json-parse-string* /string &rest args/
This function parses the JSON value in string, which must be a Lisp
string. If string doesn’t contain a valid JSON object, this function
signals the |json-parse-error| error.
The argument args is a list of keyword/argument pairs. The following
keywords are accepted:
|:object-type|
The value decides which Lisp object to use for representing the
key-value mappings of a JSON object. It can be either
|hash-table|, the default, to make hashtables with strings as
keys; |alist| to use alists with symbols as keys; or |plist| to
use plists with keyword symbols as keys.
|:array-type|
The value decides which Lisp object to use for representing a
JSON array. It can be either |array|, the default, to use Lisp
arrays; or |list| to use lists.
|:null-object|
The value decides which Lisp object to use to represent the JSON
keyword |null|. It defaults to the symbol |:null|.
|:false-object|
The value decides which Lisp object to use to represent the JSON
keyword |false|. It defaults to the symbol |:false|.
Function: *json-parse-buffer* /&rest args/
This function reads the next JSON value from the current buffer,
starting at point. It moves point to the position immediately after
the value if contains a valid JSON object; otherwise it signals the
|json-parse-error| error and doesn’t move point. The arguments args
are interpreted as in |json-parse-string|.
Next: Atomic Changes <#Atomic-Changes>, Previous: Parsing JSON
<#Parsing-JSON>, Up: Text <#Text> [Contents <#SEC_Contents>][Index
<#Index>]
32.30 JSONRPC communication
The |jsonrpc| library implements the JSONRPC specification, version 2.0,
as it is described in https://www.jsonrpc.org/
. As the name suggests, JSONRPC is a generic
/Remote Procedure Call/ protocol designed around JSON objects, which you
can convert to and from Lisp objects (see Parsing JSON <#Parsing-JSON>).
• JSONRPC Overview <#JSONRPC-Overview>
• Process-based JSONRPC connections
<#Process_002dbased-JSONRPC-connections>
• JSONRPC JSON object format <#JSONRPC-JSON-object-format>
• JSONRPC deferred requests <#JSONRPC-deferred-requests>
Next: Process-based JSONRPC connections
<#Process_002dbased-JSONRPC-connections>, Up: JSONRPC <#JSONRPC>
[Contents <#SEC_Contents>][Index <#Index>]
32.30.1 Overview
Quoting from the spec , JSONRPC "is transport
agnostic in that the concepts can be used within the same process, over
sockets, over http, or in many various message passing environments."
To model this agnosticism, the |jsonrpc| library uses objects of a
|jsonrpc-connection| class, which represent a connection to a remote
JSON endpoint (for details on Emacs’s object system, see EIEIO
in
EIEIO). In modern object-oriented parlance, this class is “abstract”,
i.e. the actual class of a useful connection object is always a subclass
of |jsonrpc-connection|. Nevertheless, we can define two distinct APIs
around the |jsonrpc-connection| class:
1. A user interface for building JSONRPC applications
In this scenario, the JSONRPC application selects a concrete
subclass of |jsonrpc-connection|, and proceeds to create objects of
that subclass using |make-instance|. To initiate a contact to the
remote endpoint, the JSONRPC application passes this object to the
functions |jsonrpc-notify|, |jsonrpc-request|, and/or
|jsonrpc-async-request|. For handling remotely initiated contacts,
which generally come in asynchronously, the instantiation should
include |:request-dispatcher| and |:notification-dispatcher|
initargs, which are both functions of 3 arguments: the connection
object; a symbol naming the JSONRPC method invoked remotely; and a
JSONRPC |params| object.
The function passed as |:request-dispatcher| is responsible for
handling the remote endpoint’s requests, which expect a reply from
the local endpoint (in this case, the program you’re building).
Inside that function, you may either return locally (a normal
return) or non-locally (an error return). A local return value must
be a Lisp object that can be serialized as JSON (see Parsing JSON
<#Parsing-JSON>). This determines a success response, and the object
is forwarded to the server as the JSONRPC |result| object. A
non-local return, achieved by calling the function |jsonrpc-error|,
causes an error response to be sent to the server. The details of
the accompanying JSONRPC |error| are filled out with whatever was
passed to |jsonrpc-error|. A non-local return triggered by an
unexpected error of any other type also causes an error response to
be sent (unless you have set |debug-on-error|, in which case this
calls the Lisp debugger, see Error Debugging <#Error-Debugging>).
2. A inheritance interface for building JSONRPC transport implementations
In this scenario, |jsonrpc-connection| is subclassed to implement a
different underlying transport strategy (for details on how to
subclass, see (eieio)Inheritance
.).
Users of the application-building interface can then instantiate
objects of this concrete class (using the |make-instance| function)
and connect to JSONRPC endpoints using that strategy.
This API has mandatory and optional parts.
To allow its users to initiate JSONRPC contacts (notifications or
requests) or reply to endpoint requests, the subclass must have an
implementation of the |jsonrpc-connection-send| method.
Likewise, for handling the three types of remote contacts (requests,
notifications, and responses to local requests), the transport
implementation must arrange for the function
|jsonrpc-connection-receive| to be called after noticing a new
JSONRPC message on the wire (whatever that "wire" may be).
Finally, and optionally, the |jsonrpc-connection| subclass should
implement the |jsonrpc-shutdown| and |jsonrpc-running-p| methods if
these concepts apply to the transport. If they do, then any system
resources (e.g. processes, timers, etc.) used to listen for messages
on the wire should be released in |jsonrpc-shutdown|, i.e. they
should only be needed while |jsonrpc-running-p| is non-nil.
Next: JSONRPC JSON object format <#JSONRPC-JSON-object-format>,
Previous: JSONRPC Overview <#JSONRPC-Overview>, Up: JSONRPC <#JSONRPC>
[Contents <#SEC_Contents>][Index <#Index>]
32.30.2 Process-based JSONRPC connections
For convenience, the |jsonrpc| library comes with a built-in
|jsonrpc-process-connection| transport implementation that can talk to
local subprocesses (using the standard input and standard output); or
TCP hosts (using sockets); or any other remote endpoint that Emacs’s
process object can represent (see Processes <#Processes>).
Using this transport, the JSONRPC messages are encoded on the wire as
plain text and prefaced by some basic HTTP-style enveloping headers,
such as “Content-Length”.
For an example of an application using this transport scheme on top of
JSONRPC, see the Language Server Protocol
.
Along with the mandatory |:request-dispatcher| and
|:notification-dispatcher| initargs, users of the
|jsonrpc-process-connection| class should pass the following initargs as
keyword-value pairs to |make-instance|:
|:process|
Value must be a live process object or a function of no arguments
producing one such object. If passed a process object, the object is
expected to contain a pre-established connection; otherwise, the
function is called immediately after the object is made.
|:on-shutdown|
Value must be a function of a single argument, the
|jsonrpc-process-connection| object. The function is called after
the underlying process object has been deleted (either deliberately
by |jsonrpc-shutdown|, or unexpectedly, because of some external
cause).
Next: JSONRPC deferred requests <#JSONRPC-deferred-requests>, Previous:
Process-based JSONRPC connections
<#Process_002dbased-JSONRPC-connections>, Up: JSONRPC <#JSONRPC>
[Contents <#SEC_Contents>][Index <#Index>]
32.30.3 JSONRPC JSON object format
JSONRPC JSON objects are exchanged as Lisp plists (see Property Lists
<#Property-Lists>): JSON-compatible plists are handed to the dispatcher
functions and, likewise, JSON-compatible plists should be given to
|jsonrpc-notify|, |jsonrpc-request|, and |jsonrpc-async-request|.
To facilitate handling plists, this library makes liberal use of
|cl-lib| library (see cl-lib
in
Common Lisp Extensions for GNU Emacs Lisp) and suggests (but doesn’t
force) its clients to do the same. A macro |jsonrpc-lambda| can be used
to create a lambda for destructuring a JSON-object like in this example:
(jsonrpc-async-request
myproc :frobnicate `(:foo "trix")
:success-fn (jsonrpc-lambda (&key bar baz &allow-other-keys)
(message "Server replied back with %s and %s!"
bar baz))
:error-fn (jsonrpc-lambda (&key code message _data)
(message "Sadly, server reports %s: %s"
code message)))
Previous: JSONRPC JSON object format <#JSONRPC-JSON-object-format>, Up:
JSONRPC <#JSONRPC> [Contents <#SEC_Contents>][Index <#Index>]
32.30.4 Deferred JSONRPC requests
In many RPC situations, synchronization between the two communicating
endpoints is a matter of correctly designing the RPC application: when
synchronization is needed, requests (which are blocking) should be used;
when it isn’t, notifications should suffice. However, when Emacs acts as
one of these endpoints, asynchronous events (e.g. timer- or
process-related) may be triggered while there is still uncertainty about
the state of the remote endpoint. Furthermore, acting on these events
may only sometimes demand synchronization, depending on the event’s
specific nature.
The |:deferred| keyword argument to |jsonrpc-request| and
|jsonrpc-async-request| is designed to let the caller indicate that the
specific request needs synchronization and its actual issuance may be
delayed to the future, until some condition is satisfied. Specifying
|:deferred| for a request doesn’t mean it /will/ be delayed, only that
it /can/ be. If the request isn’t sent immediately, |jsonrpc| will make
renewed efforts to send it at certain key times during communication,
such as when receiving or sending other messages to the endpoint.
Before any attempt to send the request, the application-specific
conditions are checked. Since the |jsonrpc| library can’t know what
these conditions are, the program can use the
|jsonrpc-connection-ready-p| generic function (see Generic Functions
<#Generic-Functions>) to specify them. The default method for this
function returns |t|, but you can add overriding methods that return
|nil| in some situations, based on the arguments passed to it, which are
the |jsonrpc-connection| object (see JSONRPC Overview
<#JSONRPC-Overview>) and whichever value you passed as the |:deferred|
keyword argument.
Next: Change Hooks <#Change-Hooks>, Previous: JSONRPC <#JSONRPC>, Up:
Text <#Text> [Contents <#SEC_Contents>][Index <#Index>]
32.31 Atomic Change Groups
In database terminology, an /atomic/ change is an indivisible change—it
can succeed entirely or it can fail entirely, but it cannot partly
succeed. A Lisp program can make a series of changes to one or several
buffers as an /atomic change group/, meaning that either the entire
series of changes will be installed in their buffers or, in case of an
error, none of them will be.
To do this for one buffer, the one already current, simply write a call
to |atomic-change-group| around the code that makes the changes, like this:
(atomic-change-group
(insert foo)
(delete-region x y))
If an error (or other nonlocal exit) occurs inside the body of
|atomic-change-group|, it unmakes all the changes in that buffer that
were during the execution of the body. This kind of change group has no
effect on any other buffers—any such changes remain.
If you need something more sophisticated, such as to make changes in
various buffers constitute one atomic group, you must directly call
lower-level functions that |atomic-change-group| uses.
Function: *prepare-change-group* /&optional buffer/
This function sets up a change group for buffer buffer, which
defaults to the current buffer. It returns a handle that represents
the change group. You must use this handle to activate the change
group and subsequently to finish it.
To use the change group, you must /activate/ it. You must do this before
making any changes in the text of buffer.
Function: *activate-change-group* /handle/
This function activates the change group that handle designates.
After you activate the change group, any changes you make in that buffer
become part of it. Once you have made all the desired changes in the
buffer, you must /finish/ the change group. There are two ways to do
this: you can either accept (and finalize) all the changes, or cancel
them all.
Function: *accept-change-group* /handle/
This function accepts all the changes in the change group specified
by handle, making them final.
Function: *cancel-change-group* /handle/
This function cancels and undoes all the changes in the change group
specified by handle.
You can cause some or all of the changes in a change group to be
considered as a single unit for the purposes of the |undo| commands (see
Undo <#Undo>) by using |undo-amalgamate-change-group|.
Function: *undo-amalgamate-change-group*
Amalgamate all the changes made in the change-group since the state
identified by handle. This function removes all undo boundaries
between undo records of changes since the state described by handle.
Usually, handle is the handle returned by |prepare-change-group|, in
which case all the changes since the beginning of the change-group
are amalgamated into a single undo unit.
Your code should use |unwind-protect| to make sure the group is always
finished. The call to |activate-change-group| should be inside the
|unwind-protect|, in case the user types C-g just after it runs. (This
is one reason why |prepare-change-group| and |activate-change-group| are
separate functions, because normally you would call
|prepare-change-group| before the start of that |unwind-protect|.) Once
you finish the group, don’t use the handle again—in particular, don’t
try to finish the same group twice.
To make a multibuffer change group, call |prepare-change-group| once for
each buffer you want to cover, then use |nconc| to combine the returned
values, like this:
(nconc (prepare-change-group buffer-1)
(prepare-change-group buffer-2))
You can then activate the multibuffer change group with a single call to
|activate-change-group|, and finish it with a single call to
|accept-change-group| or |cancel-change-group|.
Nested use of several change groups for the same buffer works as you
would expect. Non-nested use of change groups for the same buffer will
get Emacs confused, so don’t let it happen; the first change group you
start for any given buffer should be the last one finished.
Previous: Atomic Changes <#Atomic-Changes>, Up: Text <#Text> [Contents
<#SEC_Contents>][Index <#Index>]
32.32 Change Hooks
These hook variables let you arrange to take notice of changes in
buffers (or in a particular buffer, if you make them buffer-local). See
also Special Properties <#Special-Properties>, for how to detect changes
to specific parts of the text.
The functions you use in these hooks should save and restore the match
data if they do anything that uses regular expressions; otherwise, they
will interfere in bizarre ways with the editing operations that call them.
Variable: *before-change-functions*
This variable holds a list of functions to call when Emacs is about
to modify a buffer. Each function gets two arguments, the beginning
and end of the region that is about to change, represented as
integers. The buffer that is about to change is always the current
buffer when the function is called.
Variable: *after-change-functions*
This variable holds a list of functions to call after Emacs modifies
a buffer. Each function receives three arguments: the beginning and
end of the region just changed, and the length of the text that
existed before the change. All three arguments are integers. The
buffer that has been changed is always the current buffer when the
function is called.
The length of the old text is the difference between the buffer
positions before and after that text as it was before the change. As
for the changed text, its length is simply the difference between
the first two arguments.
Output of messages into the *Messages* buffer does not call these
functions, and neither do certain internal buffer changes, such as
changes in buffers created by Emacs internally for certain jobs, that
should not be visible to Lisp programs.
The vast majority of buffer changing primitives will call
|before-change-functions| and |after-change-functions| in balanced
pairs, once for each change, where the arguments to these hooks exactly
delimit the change being made. Yet, hook functions should not rely on
this always being the case, because some complex primitives call
|before-change-functions| once before making changes, and then call
|after-change-functions| zero or more times, depending on how many
individual changes the primitive is making. When that happens, the
arguments to |before-change-functions| will enclose a region in which
the individual changes are made, but won’t necessarily be the minimal
such region, and the arguments to each successive call of
|after-change-functions| will then delimit the part of text being
changed exactly. In general, we advise using either the before- or the
after-change hook, but not both.
Macro: *combine-after-change-calls* /body…/
The macro executes body normally, but arranges to call the
after-change functions just once for a series of several changes—if
that seems safe.
If a program makes several text changes in the same area of the
buffer, using the macro |combine-after-change-calls| around that
part of the program can make it run considerably faster when
after-change hooks are in use. When the after-change hooks are
ultimately called, the arguments specify a portion of the buffer
including all of the changes made within the
|combine-after-change-calls| body.
*Warning:* You must not alter the values of |after-change-functions|
within the body of a |combine-after-change-calls| form.
*Warning:* If the changes you combine occur in widely scattered
parts of the buffer, this will still work, but it is not advisable,
because it may lead to inefficient behavior for some change hook
functions.
Macro: *combine-change-calls* /beg end body…/
This executes body normally, except any buffer changes it makes do
not trigger the calls to |before-change-functions| and
|after-change-functions|. Instead there is a single call of each of
these hooks for the region enclosed by beg and end, the parameters
supplied to |after-change-functions| reflecting the changes made to
the size of the region by body.
The result of this macro is the result returned by body.
This macro is useful when a function makes a possibly large number
of repetitive changes to the buffer, and the change hooks would
otherwise take a long time to run, were they to be run for each
individual buffer modification. Emacs itself uses this macro, for
example, in the commands |comment-region| and |uncomment-region|.
*Warning:* You must not alter the values of
|before-change-functions| or |after-change-function| within body.
*Warning:* You must not make any buffer changes outside of the
region specified by beg and end.
Variable: *first-change-hook*
This variable is a normal hook that is run whenever a buffer is
changed that was previously in the unmodified state.
Variable: *inhibit-modification-hooks*
If this variable is non-|nil|, all of the change hooks are disabled;
none of them run. This affects all the hook variables described
above in this section, as well as the hooks attached to certain
special text properties (see Special Properties
<#Special-Properties>) and overlay properties (see Overlay
Properties <#Overlay-Properties>).
Also, this variable is bound to non-|nil| while running those same
hook variables, so that by default modifying the buffer from a
modification hook does not cause other modification hooks to be run.
If you do want modification hooks to be run in a particular piece of
code that is itself run from a modification hook, then rebind
locally |inhibit-modification-hooks| to |nil|. However, doing this
may cause recursive calls to the modification hooks, so be sure to
prepare for that (for example, by binding some variable which tells
your hook to do nothing).
We recommend that you only bind this variable for modifications that
do not result in lasting changes to buffer text contents (for
example face changes or temporary modifications). If you need to
delay change hooks during a series of changes (typically for
performance reasons), use |combine-change-calls| or
|combine-after-change-calls| instead.
Next: Searching and Matching <#Searching-and-Matching>, Previous: Text
<#Text>, Up: Top <#Top> [Contents <#SEC_Contents>][Index <#Index>]
33 Non-ASCII Characters
This chapter covers the special issues relating to characters and how
they are stored in strings and buffers.
• Text Representations <#Text-Representations> How Emacs represents
text.
• Disabling Multibyte <#Disabling-Multibyte> Controlling whether to
use multibyte characters.
• Converting Representations <#Converting-Representations> Converting
unibyte to multibyte and vice versa.
• Selecting a Representation <#Selecting-a-Representation> Treating a
byte sequence as unibyte or multi.
• Character Codes <#Character-Codes> How unibyte and multibyte relate
to codes of individual characters.
• Character Properties <#Character-Properties> Character attributes
that define their behavior and handling.
• Character Sets <#Character-Sets> The space of possible character
codes is divided into various character sets.
• Scanning Charsets <#Scanning-Charsets> Which character sets are
used in a buffer?
• Translation of Characters <#Translation-of-Characters> Translation
tables are used for conversion.
• Coding Systems <#Coding-Systems> Coding systems are conversions for
saving files.
• Input Methods <#Input-Methods> Input methods allow users to enter
various non-ASCII characters without special keyboards.
• Locales <#Locales> Interacting with the POSIX locale.
Next: Disabling Multibyte <#Disabling-Multibyte>, Up: Non-ASCII
Characters <#Non_002dASCII-Characters> [Contents
<#SEC_Contents>][Index <#Index>]
33.1 Text Representations
Emacs buffers and strings support a large repertoire of characters from
many different scripts, allowing users to type and display text in
almost any known written language.
To support this multitude of characters and scripts, Emacs closely
follows the /Unicode Standard/. The Unicode Standard assigns a unique
number, called a /codepoint/, to each and every character. The range of
codepoints defined by Unicode, or the Unicode /codespace/, is
|0..#x10FFFF| (in hexadecimal notation), inclusive. Emacs extends this
range with codepoints in the range |#x110000..#x3FFFFF|, which it uses
for representing characters that are not unified with Unicode and /raw
8-bit bytes/ that cannot be interpreted as characters. Thus, a character
codepoint in Emacs is a 22-bit integer.
To conserve memory, Emacs does not hold fixed-length 22-bit numbers that
are codepoints of text characters within buffers and strings. Rather,
Emacs uses a variable-length internal representation of characters, that
stores each character as a sequence of 1 to 5 8-bit bytes, depending on
the magnitude of its codepoint^17 <#FOOT17>. For example, any ASCII
character takes up only 1 byte, a Latin-1 character takes up 2 bytes,
etc. We call this representation of text /multibyte/.
Outside Emacs, characters can be represented in many different
encodings, such as ISO-8859-1, GB-2312, Big-5, etc. Emacs converts
between these external encodings and its internal representation, as
appropriate, when it reads text into a buffer or a string, or when it
writes text to a disk file or passes it to some other process.
Occasionally, Emacs needs to hold and manipulate encoded text or binary
non-text data in its buffers or strings. For example, when Emacs visits
a file, it first reads the file’s text verbatim into a buffer, and only
then converts it to the internal representation. Before the conversion,
the buffer holds encoded text.
Encoded text is not really text, as far as Emacs is concerned, but
rather a sequence of raw 8-bit bytes. We call buffers and strings that
hold encoded text /unibyte/ buffers and strings, because Emacs treats
them as a sequence of individual bytes. Usually, Emacs displays unibyte
buffers and strings as octal codes such as |\237|. We recommend that you
never use unibyte buffers and strings except for manipulating encoded
text or binary non-text data.
In a buffer, the buffer-local value of the variable
|enable-multibyte-characters| specifies the representation used. The
representation for a string is determined and recorded in the string
when the string is constructed.
Variable: *enable-multibyte-characters*
This variable specifies the current buffer’s text representation. If
it is non-|nil|, the buffer contains multibyte text; otherwise, it
contains unibyte encoded text or binary non-text data.
You cannot set this variable directly; instead, use the function
|set-buffer-multibyte| to change a buffer’s representation.
Function: *position-bytes* /position/
Buffer positions are measured in character units. This function
returns the byte-position corresponding to buffer position position
in the current buffer. This is 1 at the start of the buffer, and
counts upward in bytes. If position is out of range, the value is
|nil|.
Function: *byte-to-position* /byte-position/
Return the buffer position, in character units, corresponding to
given byte-position in the current buffer. If byte-position is out
of range, the value is |nil|. In a multibyte buffer, an arbitrary
value of byte-position can be not at character boundary, but inside
a multibyte sequence representing a single character; in this case,
this function returns the buffer position of the character whose
multibyte sequence includes byte-position. In other words, the value
does not change for all byte positions that belong to the same
character.
The following two functions are useful when a Lisp program needs to map
buffer positions to byte offsets in a file visited by the buffer.
Function: *bufferpos-to-filepos* /position &optional quality coding-system/
This function is similar to |position-bytes|, but instead of byte
position in the current buffer it returns the offset from the
beginning of the current buffer’s file of the byte that corresponds
to the given character position in the buffer. The conversion
requires to know how the text is encoded in the buffer’s file; this
is what the coding-system argument is for, defaulting to the value
of |buffer-file-coding-system|. The optional argument quality
specifies how accurate the result should be; it should be one of the
following:
|exact|
The result must be accurate. The function may need to encode and
decode a large part of the buffer, which is expensive and can be
slow.
|approximate|
The value can be an approximation. The function may avoid
expensive processing and return an inexact result.
|nil|
If the exact result needs expensive processing, the function
will return |nil| rather than an approximation. This is the
default if the argument is omitted.
Function: *filepos-to-bufferpos* /byte &optional quality coding-system/
This function returns the buffer position corresponding to a file
position specified by byte, a zero-base byte offset from the file’s
beginning. The function performs the conversion opposite to what
|bufferpos-to-filepos| does. Optional arguments quality and
coding-system have the same meaning and values as for
|bufferpos-to-filepos|.
Function: *multibyte-string-p* /string/
Return |t| if string is a multibyte string, |nil| otherwise. This
function also returns |nil| if string is some object other than a
string.
Function: *string-bytes* /string/
This function returns the number of bytes in string. If string is a
multibyte string, this can be greater than |(length string)|.
Function: *unibyte-string* /&rest bytes/
This function concatenates all its argument bytes and makes the
result a unibyte string.
Next: Converting Representations <#Converting-Representations>,
Previous: Text Representations <#Text-Representations>, Up: Non-ASCII
Characters <#Non_002dASCII-Characters> [Contents
<#SEC_Contents>][Index <#Index>]
33.2 Disabling Multibyte Characters
By default, Emacs starts in multibyte mode: it stores the contents of
buffers and strings using an internal encoding that represents non-ASCII
characters using multi-byte sequences. Multibyte mode allows you to use
all the supported languages and scripts without limitations.
Under very special circumstances, you may want to disable multibyte
character support, for a specific buffer. When multibyte characters are
disabled in a buffer, we call that /unibyte mode/. In unibyte mode, each
character in the buffer has a character code ranging from 0 through 255
(0377 octal); 0 through 127 (0177 octal) represent ASCII characters, and
128 (0200 octal) through 255 (0377 octal) represent non-ASCII characters.
To edit a particular file in unibyte representation, visit it using
|find-file-literally|. See Visiting Functions <#Visiting-Functions>. You
can convert a multibyte buffer to unibyte by saving it to a file,
killing the buffer, and visiting the file again with
|find-file-literally|. Alternatively, you can use C-x RET c
(|universal-coding-system-argument|) and specify ‘raw-text’ as the
coding system with which to visit or save a file. See Specifying a
Coding System for File Text
in GNU Emacs Manual. Unlike |find-file-literally|, finding a file as
‘raw-text’ doesn’t disable format conversion, uncompression, or auto
mode selection.
The buffer-local variable |enable-multibyte-characters| is non-|nil| in
multibyte buffers, and |nil| in unibyte ones. The mode line also
indicates whether a buffer is multibyte or not. With a graphical
display, in a multibyte buffer, the portion of the mode line that
indicates the character set has a tooltip that (amongst other things)
says that the buffer is multibyte. In a unibyte buffer, the character
set indicator is absent. Thus, in a unibyte buffer (when using a
graphical display) there is normally nothing before the indication of
the visited file’s end-of-line convention (colon, backslash, etc.),
unless you are using an input method.
You can turn off multibyte support in a specific buffer by invoking the
command |toggle-enable-multibyte-characters| in that buffer.
Next: Selecting a Representation <#Selecting-a-Representation>,
Previous: Disabling Multibyte <#Disabling-Multibyte>, Up: Non-ASCII
Characters <#Non_002dASCII-Characters> [Contents
<#SEC_Contents>][Index <#Index>]
33.3 Converting Text Representations
Emacs can convert unibyte text to multibyte; it can also convert
multibyte text to unibyte, provided that the multibyte text contains
only ASCII and 8-bit raw bytes. In general, these conversions happen
when inserting text into a buffer, or when putting text from several
strings together in one string. You can also explicitly convert a
string’s contents to either representation.
Emacs chooses the representation for a string based on the text from
which it is constructed. The general rule is to convert unibyte text to
multibyte text when combining it with other multibyte text, because the
multibyte representation is more general and can hold whatever
characters the unibyte text has.
When inserting text into a buffer, Emacs converts the text to the
buffer’s representation, as specified by |enable-multibyte-characters|
in that buffer. In particular, when you insert multibyte text into a
unibyte buffer, Emacs converts the text to unibyte, even though this
conversion cannot in general preserve all the characters that might be
in the multibyte text. The other natural alternative, to convert the
buffer contents to multibyte, is not acceptable because the buffer’s
representation is a choice made by the user that cannot be overridden
automatically.
Converting unibyte text to multibyte text leaves ASCII characters
unchanged, and converts bytes with codes 128 through 255 to the
multibyte representation of raw eight-bit bytes.
Converting multibyte text to unibyte converts all ASCII and eight-bit
characters to their single-byte form, but loses information for
non-ASCII characters by discarding all but the low 8 bits of each
character’s codepoint. Converting unibyte text to multibyte and back to
unibyte reproduces the original unibyte text.
The next two functions either return the argument string, or a newly
created string with no text properties.
Function: *string-to-multibyte* /string/
This function returns a multibyte string containing the same
sequence of characters as string. If string is a multibyte string,
it is returned unchanged. The function assumes that string includes
only ASCII characters and raw 8-bit bytes; the latter are converted
to their multibyte representation corresponding to the codepoints
|#x3FFF80| through |#x3FFFFF|, inclusive (see codepoints
<#Text-Representations>).
Function: *string-to-unibyte* /string/
This function returns a unibyte string containing the same sequence
of characters as string. It signals an error if string contains a
non-ASCII character. If string is a unibyte string, it is returned
unchanged. Use this function for string arguments that contain only
ASCII and eight-bit characters.
Function: *byte-to-string* /byte/
This function returns a unibyte string containing a single byte of
character data, byte. It signals an error if byte is not an integer
between 0 and 255.
Function: *multibyte-char-to-unibyte* /char/
This converts the multibyte character char to a unibyte character,
and returns that character. If char is neither ASCII nor eight-bit,
the function returns -1.
Function: *unibyte-char-to-multibyte* /char/
This convert the unibyte character char to a multibyte character,
assuming char is either ASCII or raw 8-bit byte.
Next: Character Codes <#Character-Codes>, Previous: Converting
Representations <#Converting-Representations>, Up: Non-ASCII Characters
<#Non_002dASCII-Characters> [Contents <#SEC_Contents>][Index <#Index>]
33.4 Selecting a Representation
Sometimes it is useful to examine an existing buffer or string as
multibyte when it was unibyte, or vice versa.
Function: *set-buffer-multibyte* /multibyte/
Set the representation type of the current buffer. If multibyte is
non-|nil|, the buffer becomes multibyte. If multibyte is |nil|, the
buffer becomes unibyte.
This function leaves the buffer contents unchanged when viewed as a
sequence of bytes. As a consequence, it can change the contents
viewed as characters; for instance, a sequence of three bytes which
is treated as one character in multibyte representation will count
as three characters in unibyte representation. Eight-bit characters
representing raw bytes are an exception. They are represented by one
byte in a unibyte buffer, but when the buffer is set to multibyte,
they are converted to two-byte sequences, and vice versa.
This function sets |enable-multibyte-characters| to record which
representation is in use. It also adjusts various data in the buffer
(including overlays, text properties and markers) so that they cover
the same text as they did before.
This function signals an error if the buffer is narrowed, since the
narrowing might have occurred in the middle of multibyte character
sequences.
This function also signals an error if the buffer is an indirect
buffer. An indirect buffer always inherits the representation of its
base buffer.
Function: *string-as-unibyte* /string/
If string is already a unibyte string, this function returns string
itself. Otherwise, it returns a new string with the same bytes as
string, but treating each byte as a separate character (so that the
value may have more characters than string); as an exception, each
eight-bit character representing a raw byte is converted into a
single byte. The newly-created string contains no text properties.
Function: *string-as-multibyte* /string/
If string is a multibyte string, this function returns string
itself. Otherwise, it returns a new string with the same bytes as
string, but treating each multibyte sequence as one character. This
means that the value may have fewer characters than string has. If a
byte sequence in string is invalid as a multibyte representation of
a single character, each byte in the sequence is treated as a raw
8-bit byte. The newly-created string contains no text properties.
Next: Character Properties <#Character-Properties>, Previous: Selecting
a Representation <#Selecting-a-Representation>, Up: Non-ASCII Characters
<#Non_002dASCII-Characters> [Contents <#SEC_Contents>][Index <#Index>]
33.5 Character Codes
The unibyte and multibyte text representations use different character
codes. The valid character codes for unibyte representation range from 0
to |#xFF| (255)—the values that can fit in one byte. The valid character
codes for multibyte representation range from 0 to |#x3FFFFF|. In this
code space, values 0 through |#x7F| (127) are for ASCII characters, and
values |#x80| (128) through |#x3FFF7F| (4194175) are for non-ASCII
characters.
Emacs character codes are a superset of the Unicode standard. Values 0
through |#x10FFFF| (1114111) correspond to Unicode characters of the
same codepoint; values |#x110000| (1114112) through |#x3FFF7F| (4194175)
represent characters that are not unified with Unicode; and values
|#x3FFF80| (4194176) through |#x3FFFFF| (4194303) represent eight-bit
raw bytes.
Function: *characterp* /charcode/
This returns |t| if charcode is a valid character, and |nil| otherwise.
(characterp 65)
⇒ t
(characterp 4194303)
⇒ t
(characterp 4194304)
⇒ nil
Function: *max-char*
This function returns the largest value that a valid character
codepoint can have.
(characterp (max-char))
⇒ t
(characterp (1+ (max-char)))
⇒ nil
Function: *char-from-name* /string &optional ignore-case/
This function returns the character whose Unicode name is string. If
ignore-case is non-|nil|, case is ignored in string. This function
returns |nil| if string does not name a character.
;; U+03A3
(= (char-from-name "GREEK CAPITAL LETTER SIGMA") #x03A3)
⇒ t
Function: *get-byte* /&optional pos string/
This function returns the byte at character position pos in the
current buffer. If the current buffer is unibyte, this is literally
the byte at that position. If the buffer is multibyte, byte values
of ASCII characters are the same as character codepoints, whereas
eight-bit raw bytes are converted to their 8-bit codes. The function
signals an error if the character at pos is non-ASCII.
The optional argument string means to get a byte value from that
string instead of the current buffer.
Next: Character Sets <#Character-Sets>, Previous: Character Codes
<#Character-Codes>, Up: Non-ASCII Characters <#Non_002dASCII-Characters>
[Contents <#SEC_Contents>][Index <#Index>]
33.6 Character Properties
A /character property/ is a named attribute of a character that
specifies how the character behaves and how it should be handled during
text processing and display. Thus, character properties are an important
part of specifying the character’s semantics.
On the whole, Emacs follows the Unicode Standard in its implementation
of character properties. In particular, Emacs supports the Unicode
Character Property Model , and
the Emacs character property database is derived from the Unicode
Character Database (UCD). See the Character Properties chapter of the
Unicode Standard
, for a
detailed description of Unicode character properties and their meaning.
This section assumes you are already familiar with that chapter of the
Unicode Standard, and want to apply that knowledge to Emacs Lisp programs.
In Emacs, each property has a name, which is a symbol, and a set of
possible values, whose types depend on the property; if a character does
not have a certain property, the value is |nil|. As a general rule, the
names of character properties in Emacs are produced from the
corresponding Unicode properties by downcasing them and replacing each
‘_’ character with a dash ‘-’. For example, |Canonical_Combining_Class|
becomes |canonical-combining-class|. However, sometimes we shorten the
names to make their use easier.
Some codepoints are left /unassigned/ by the UCD—they don’t correspond
to any character. The Unicode Standard defines default values of
properties for such codepoints; they are mentioned below for each property.
Here is the full list of value types for all the character properties
that Emacs knows about:
|name|
Corresponds to the |Name| Unicode property. The value is a string
consisting of upper-case Latin letters A to Z, digits, spaces, and
hyphen ‘-’ characters. For unassigned codepoints, the value is |nil|.
|general-category|
Corresponds to the |General_Category| Unicode property. The value is
a symbol whose name is a 2-letter abbreviation of the character’s
classification. For unassigned codepoints, the value is |Cn|.
|canonical-combining-class|
Corresponds to the |Canonical_Combining_Class| Unicode property. The
value is an integer. For unassigned codepoints, the value is zero.
|bidi-class|
Corresponds to the Unicode |Bidi_Class| property. The value is a
symbol whose name is the Unicode /directional type/ of the
character. Emacs uses this property when it reorders bidirectional
text for display (see Bidirectional Display
<#Bidirectional-Display>). For unassigned codepoints, the value
depends on the code blocks to which the codepoint belongs: most
unassigned codepoints get the value of |L| (strong L), but some get
values of |AL| (Arabic letter) or |R| (strong R).
|decomposition|
Corresponds to the Unicode properties |Decomposition_Type| and
|Decomposition_Value|. The value is a list, whose first element may
be a symbol representing a compatibility formatting tag, such as
|small|^18 <#FOOT18>; the other elements are characters that give
the compatibility decomposition sequence of this character. For
characters that don’t have decomposition sequences, and for
unassigned codepoints, the value is a list with a single member, the
character itself.
|decimal-digit-value|
Corresponds to the Unicode |Numeric_Value| property for characters
whose |Numeric_Type| is ‘Decimal’. The value is an integer, or |nil|
if the character has no decimal digit value. For unassigned
codepoints, the value is |nil|, which means NaN, or “not a number”.
|digit-value|
Corresponds to the Unicode |Numeric_Value| property for characters
whose |Numeric_Type| is ‘Digit’. The value is an integer. Examples
of such characters include compatibility subscript and superscript
digits, for which the value is the corresponding number. For
characters that don’t have any numeric value, and for unassigned
codepoints, the value is |nil|, which means NaN.
|numeric-value|
Corresponds to the Unicode |Numeric_Value| property for characters
whose |Numeric_Type| is ‘Numeric’. The value of this property is a
number. Examples of characters that have this property include
fractions, subscripts, superscripts, Roman numerals, currency
numerators, and encircled numbers. For example, the value of this
property for the character U+2155 VULGAR FRACTION ONE FIFTH is
|0.2|. For characters that don’t have any numeric value, and for
unassigned codepoints, the value is |nil|, which means NaN.
|mirrored|
Corresponds to the Unicode |Bidi_Mirrored| property. The value of
this property is a symbol, either |Y| or |N|. For unassigned
codepoints, the value is |N|.
|mirroring|
Corresponds to the Unicode |Bidi_Mirroring_Glyph| property. The
value of this property is a character whose glyph represents the
mirror image of the character’s glyph, or |nil| if there’s no
defined mirroring glyph. All the characters whose |mirrored|
property is |N| have |nil| as their |mirroring| property; however,
some characters whose |mirrored| property is |Y| also have |nil| for
|mirroring|, because no appropriate characters exist with mirrored
glyphs. Emacs uses this property to display mirror images of
characters when appropriate (see Bidirectional Display
<#Bidirectional-Display>). For unassigned codepoints, the value is
|nil|.
|paired-bracket|
Corresponds to the Unicode |Bidi_Paired_Bracket| property. The value
of this property is the codepoint of a character’s /paired bracket/,
or |nil| if the character is not a bracket character. This
establishes a mapping between characters that are treated as bracket
pairs by the Unicode Bidirectional Algorithm; Emacs uses this
property when it decides how to reorder for display parentheses,
braces, and other similar characters (see Bidirectional Display
<#Bidirectional-Display>).
|bracket-type|
Corresponds to the Unicode |Bidi_Paired_Bracket_Type| property. For
characters whose |paired-bracket| property is non-|nil|, the value
of this property is a symbol, either |o| (for opening bracket
characters) or |c| (for closing bracket characters). For characters
whose |paired-bracket| property is |nil|, the value is the symbol
|n| (None). Like |paired-bracket|, this property is used for
bidirectional display.
|old-name|
Corresponds to the Unicode |Unicode_1_Name| property. The value is a
string. For unassigned codepoints, and characters that have no value
for this property, the value is |nil|.
|iso-10646-comment|
Corresponds to the Unicode |ISO_Comment| property. The value is
either a string or |nil|. For unassigned codepoints, the value is
|nil|.
|uppercase|
Corresponds to the Unicode |Simple_Uppercase_Mapping| property. The
value of this property is a single character. For unassigned
codepoints, the value is |nil|, which means the character itself.
|lowercase|
Corresponds to the Unicode |Simple_Lowercase_Mapping| property. The
value of this property is a single character. For unassigned
codepoints, the value is |nil|, which means the character itself.
|titlecase|
Corresponds to the Unicode |Simple_Titlecase_Mapping| property.
/Title case/ is a special form of a character used when the first
character of a word needs to be capitalized. The value of this
property is a single character. For unassigned codepoints, the value
is |nil|, which means the character itself.
|special-uppercase|
Corresponds to Unicode language- and context-independent special
upper-casing rules. The value of this property is a string (which
may be empty). For example mapping for U+00DF LATIN SMALL LETTER
SHARP S is |"SS"|. For characters with no special mapping, the value
is |nil| which means |uppercase| property needs to be consulted
instead.
|special-lowercase|
Corresponds to Unicode language- and context-independent special
lower-casing rules. The value of this property is a string (which
may be empty). For example mapping for U+0130 LATIN CAPITAL LETTER I
WITH DOT ABOVE the value is |"i\u0307"| (i.e. 2-character string
consisting of LATIN SMALL LETTER I followed by U+0307 COMBINING DOT
ABOVE). For characters with no special mapping, the value is |nil|
which means |lowercase| property needs to be consulted instead.
|special-titlecase|
Corresponds to Unicode unconditional special title-casing rules. The
value of this property is a string (which may be empty). For example
mapping for U+FB01 LATIN SMALL LIGATURE FI the value is |"Fi"|. For
characters with no special mapping, the value is |nil| which means
|titlecase| property needs to be consulted instead.
Function: *get-char-code-property* /char propname/
This function returns the value of char’s propname property.
(get-char-code-property ?\s 'general-category)
⇒ Zs
(get-char-code-property ?1 'general-category)
⇒ Nd
;; U+2084
(get-char-code-property ?\N{SUBSCRIPT FOUR}
'digit-value)
⇒ 4
;; U+2155
(get-char-code-property ?\N{VULGAR FRACTION ONE FIFTH}
'numeric-value)
⇒ 0.2
;; U+2163
(get-char-code-property ?\N{ROMAN NUMERAL FOUR}
'numeric-value)
⇒ 4
(get-char-code-property ?\( 'paired-bracket)
⇒ 41 ;; closing parenthesis
(get-char-code-property ?\) 'bracket-type)
⇒ c
Function: *char-code-property-description* /prop value/
This function returns the description string of property prop’s
value, or |nil| if value has no description.
(char-code-property-description 'general-category 'Zs)
⇒ "Separator, Space"
(char-code-property-description 'general-category 'Nd)
⇒ "Number, Decimal Digit"
(char-code-property-description 'numeric-value '1/5)
⇒ nil
Function: *put-char-code-property* /char propname value/
This function stores value as the value of the property propname for
the character char.
Variable: *unicode-category-table*
The value of this variable is a char-table (see Char-Tables
<#Char_002dTables>) that specifies, for each character, its Unicode
|General_Category| property as a symbol.
Variable: *char-script-table*
The value of this variable is a char-table that specifies, for each
character, a symbol whose name is the script to which the character
belongs, according to the Unicode Standard classification of the
Unicode code space into script-specific blocks. This char-table has
a single extra slot whose value is the list of all script symbols.
Variable: *char-width-table*
The value of this variable is a char-table that specifies the width
of each character in columns that it will occupy on the screen.
Variable: *printable-chars*
The value of this variable is a char-table that specifies, for each
character, whether it is printable or not. That is, if evaluating
|(aref printable-chars char)| results in |t|, the character is
printable, and if it results in |nil|, it is not.
Next: Scanning Charsets <#Scanning-Charsets>, Previous: Character
Properties <#Character-Properties>, Up: Non-ASCII Characters
<#Non_002dASCII-Characters> [Contents <#SEC_Contents>][Index <#Index>]
33.7 Character Sets
An Emacs /character set/, or /charset/, is a set of characters in which
each character is assigned a numeric code point. (The Unicode Standard
calls this a /coded character set/.) Each Emacs charset has a name which
is a symbol. A single character can belong to any number of different
character sets, but it will generally have a different code point in
each charset. Examples of character sets include |ascii|, |iso-8859-1|,
|greek-iso8859-7|, and |windows-1255|. The code point assigned to a
character in a charset is usually different from its code point used in
Emacs buffers and strings.
Emacs defines several special character sets. The character set
|unicode| includes all the characters whose Emacs code points are in the
range |0..#x10FFFF|. The character set |emacs| includes all ASCII and
non-ASCII characters. Finally, the |eight-bit| charset includes the
8-bit raw bytes; Emacs uses it to represent raw bytes encountered in text.
Function: *charsetp* /object/
Returns |t| if object is a symbol that names a character set, |nil|
otherwise.
Variable: *charset-list*
The value is a list of all defined character set names.
Function: *charset-priority-list* /&optional highestp/
This function returns a list of all defined character sets ordered
by their priority. If highestp is non-|nil|, the function returns a
single character set of the highest priority.
Function: *set-charset-priority* /&rest charsets/
This function makes charsets the highest priority character sets.
Function: *char-charset* /character &optional restriction/
This function returns the name of the character set of highest
priority that character belongs to. ASCII characters are an
exception: for them, this function always returns |ascii|.
If restriction is non-|nil|, it should be a list of charsets to
search. Alternatively, it can be a coding system, in which case the
returned charset must be supported by that coding system (see Coding
Systems <#Coding-Systems>).
Function: *charset-plist* /charset/
This function returns the property list of the character set
charset. Although charset is a symbol, this is not the same as the
property list of that symbol. Charset properties include important
information about the charset, such as its documentation string,
short name, etc.
Function: *put-charset-property* /charset propname value/
This function sets the propname property of charset to the given value.
Function: *get-charset-property* /charset propname/
This function returns the value of charsets property propname.
Command: *list-charset-chars* /charset/
This command displays a list of characters in the character set
charset.
Emacs can convert between its internal representation of a character and
the character’s codepoint in a specific charset. The following two
functions support these conversions.
Function: *decode-char* /charset code-point/
This function decodes a character that is assigned a code-point in
charset, to the corresponding Emacs character, and returns it. If
charset doesn’t contain a character of that code point, the value is
|nil|.
For backward compatibility, if code-point doesn’t fit in a Lisp
fixnum (see most-positive-fixnum <#Integer-Basics>), it can be
specified as a cons cell |(high . low)|, where low are the lower 16
bits of the value and high are the high 16 bits. This usage is
obsolescent.
Function: *encode-char* /char charset/
This function returns the code point assigned to the character char
in charset. If charset doesn’t have a codepoint for char, the value
is |nil|.
The following function comes in handy for applying a certain function to
all or part of the characters in a charset:
Function: *map-charset-chars* /function charset &optional arg from-code
to-code/
Call function for characters in charset. function is called with two
arguments. The first one is a cons cell |(from . to)|, where from
and to indicate a range of characters contained in charset. The
second argument passed to function is arg.
By default, the range of codepoints passed to function includes all
the characters in charset, but optional arguments from-code and
to-code limit that to the range of characters between these two
codepoints of charset. If either of them is |nil|, it defaults to
the first or last codepoint of charset, respectively.
Next: Translation of Characters <#Translation-of-Characters>, Previous:
Character Sets <#Character-Sets>, Up: Non-ASCII Characters
<#Non_002dASCII-Characters> [Contents <#SEC_Contents>][Index <#Index>]
33.8 Scanning for Character Sets
Sometimes it is useful to find out which character set a particular
character belongs to. One use for this is in determining which coding
systems (see Coding Systems <#Coding-Systems>) are capable of
representing all of the text in question; another is to determine the
font(s) for displaying that text.
Function: *charset-after* /&optional pos/
This function returns the charset of highest priority containing the
character at position pos in the current buffer. If pos is omitted
or |nil|, it defaults to the current value of point. If pos is out
of range, the value is |nil|.
Function: *find-charset-region* /beg end &optional translation/
This function returns a list of the character sets of highest
priority that contain characters in the current buffer between
positions beg and end.
The optional argument translation specifies a translation table to
use for scanning the text (see Translation of Characters
<#Translation-of-Characters>). If it is non-|nil|, then each
character in the region is translated through this table, and the
value returned describes the translated characters instead of the
characters actually in the buffer.
Function: *find-charset-string* /string &optional translation/
This function returns a list of character sets of highest priority
that contain characters in string. It is just like
|find-charset-region|, except that it applies to the contents of
string instead of part of the current buffer.
Next: Coding Systems <#Coding-Systems>, Previous: Scanning Charsets
<#Scanning-Charsets>, Up: Non-ASCII Characters
<#Non_002dASCII-Characters> [Contents <#SEC_Contents>][Index <#Index>]
33.9 Translation of Characters
A /translation table/ is a char-table (see Char-Tables
<#Char_002dTables>) that specifies a mapping of characters into
characters. These tables are used in encoding and decoding, and for
other purposes. Some coding systems specify their own particular
translation tables; there are also default translation tables which
apply to all other coding systems.
A translation table has two extra slots. The first is either |nil| or a
translation table that performs the reverse translation; the second is
the maximum number of characters to look up for translating sequences of
characters (see the description of |make-translation-table-from-alist|
below).
Function: *make-translation-table* /&rest translations/
This function returns a translation table based on the argument
translations. Each element of translations should be a list of
elements of the form |(from . to)|; this says to translate the
character from into to.
The arguments and the forms in each argument are processed in order,
and if a previous form already translates to to some other
character, say to-alt, from is also translated to to-alt.
During decoding, the translation table’s translations are applied to the
characters that result from ordinary decoding. If a coding system has
the property |:decode-translation-table|, that specifies the translation
table to use, or a list of translation tables to apply in sequence.
(This is a property of the coding system, as returned by
|coding-system-get|, not a property of the symbol that is the coding
system’s name. See Basic Concepts of Coding Systems
<#Coding-System-Basics>.) Finally, if
|standard-translation-table-for-decode| is non-|nil|, the resulting
characters are translated by that table.
During encoding, the translation table’s translations are applied to the
characters in the buffer, and the result of translation is actually
encoded. If a coding system has property |:encode-translation-table|,
that specifies the translation table to use, or a list of translation
tables to apply in sequence. In addition, if the variable
|standard-translation-table-for-encode| is non-|nil|, it specifies the
translation table to use for translating the result.
Variable: *standard-translation-table-for-decode*
This is the default translation table for decoding. If a coding
systems specifies its own translation tables, the table that is the
value of this variable, if non-|nil|, is applied after them.
Variable: *standard-translation-table-for-encode*
This is the default translation table for encoding. If a coding
systems specifies its own translation tables, the table that is the
value of this variable, if non-|nil|, is applied after them.
Variable: *translation-table-for-input*
Self-inserting characters are translated through this translation
table before they are inserted. Search commands also translate their
input through this table, so they can compare more reliably with
what’s in the buffer.
This variable automatically becomes buffer-local when set.
Function: *make-translation-table-from-vector* /vec/
This function returns a translation table made from vec that is an
array of 256 elements to map bytes (values 0 through #xFF) to
characters. Elements may be |nil| for untranslated bytes. The
returned table has a translation table for reverse mapping in the
first extra slot, and the value |1| in the second extra slot.
This function provides an easy way to make a private coding system
that maps each byte to a specific character. You can specify the
returned table and the reverse translation table using the
properties |:decode-translation-table| and
|:encode-translation-table| respectively in the props argument to
|define-coding-system|.
Function: *make-translation-table-from-alist* /alist/
This function is similar to |make-translation-table| but returns a
complex translation table rather than a simple one-to-one mapping.
Each element of alist is of the form |(from . to)|, where from and
to are either characters or vectors specifying a sequence of
characters. If from is a character, that character is translated to
to (i.e., to a character or a character sequence). If from is a
vector of characters, that sequence is translated to to. The
returned table has a translation table for reverse mapping in the
first extra slot, and the maximum length of all the from character
sequences in the second extra slot.
Next: Input Methods <#Input-Methods>, Previous: Translation of
Characters <#Translation-of-Characters>, Up: Non-ASCII Characters
<#Non_002dASCII-Characters> [Contents <#SEC_Contents>][Index <#Index>]
33.10 Coding Systems
When Emacs reads or writes a file, and when Emacs sends text to a
subprocess or receives text from a subprocess, it normally performs
character code conversion and end-of-line conversion as specified by a
particular /coding system/.
How to define a coding system is an arcane matter, and is not documented
here.
• Coding System Basics <#Coding-System-Basics> Basic concepts.
• Encoding and I/O <#Encoding-and-I_002fO> How file I/O functions
handle coding systems.
• Lisp and Coding Systems <#Lisp-and-Coding-Systems> Functions to
operate on coding system names.
• User-Chosen Coding Systems <#User_002dChosen-Coding-Systems> Asking
the user to choose a coding system.
• Default Coding Systems <#Default-Coding-Systems> Controlling the
default choices.
• Specifying Coding Systems <#Specifying-Coding-Systems> Requesting a
particular coding system for a single file operation.
• Explicit Encoding <#Explicit-Encoding> Encoding or decoding text
without doing I/O.
• Terminal I/O Encoding <#Terminal-I_002fO-Encoding> Use of encoding
for terminal I/O.
Next: Encoding and I/O <#Encoding-and-I_002fO>, Up: Coding Systems
<#Coding-Systems> [Contents <#SEC_Contents>][Index <#Index>]
33.10.1 Basic Concepts of Coding Systems
/Character code conversion/ involves conversion between the internal
representation of characters used inside Emacs and some other encoding.
Emacs supports many different encodings, in that it can convert to and
from them. For example, it can convert text to or from encodings such as
Latin 1, Latin 2, Latin 3, Latin 4, Latin 5, and several variants of ISO
2022. In some cases, Emacs supports several alternative encodings for
the same characters; for example, there are three coding systems for the
Cyrillic (Russian) alphabet: ISO, Alternativnyj, and KOI8.
Every coding system specifies a particular set of character code
conversions, but the coding system |undecided| is special: it leaves the
choice unspecified, to be chosen heuristically for each file, based on
the file’s data. The coding system |prefer-utf-8| is like |undecided|,
but it prefers to choose |utf-8| when possible.
In general, a coding system doesn’t guarantee roundtrip identity:
decoding a byte sequence using a coding system, then encoding the
resulting text in the same coding system, can produce a different byte
sequence. But some coding systems do guarantee that the byte sequence
will be the same as what you originally decoded. Here are a few examples:
iso-8859-1, utf-8, big5, shift_jis, euc-jp
Encoding buffer text and then decoding the result can also fail to
reproduce the original text. For instance, if you encode a character
with a coding system which does not support that character, the result
is unpredictable, and thus decoding it using the same coding system may
produce a different text. Currently, Emacs can’t report errors that
result from encoding unsupported characters.
/End of line conversion/ handles three different conventions used on
various systems for representing end of line in files. The Unix
convention, used on GNU and Unix systems, is to use the linefeed
character (also called newline). The DOS convention, used on MS-Windows
and MS-DOS systems, is to use a carriage return and a linefeed at the
end of a line. The Mac convention is to use just carriage return. (This
was the convention used in Classic Mac OS.)
/Base coding systems/ such as |latin-1| leave the end-of-line conversion
unspecified, to be chosen based on the data. /Variant coding systems/
such as |latin-1-unix|, |latin-1-dos| and |latin-1-mac| specify the
end-of-line conversion explicitly as well. Most base coding systems have
three corresponding variants whose names are formed by adding ‘-unix’,
‘-dos’ and ‘-mac’.
The coding system |raw-text| is special in that it prevents character
code conversion, and causes the buffer visited with this coding system
to be a unibyte buffer. For historical reasons, you can save both
unibyte and multibyte text with this coding system. When you use
|raw-text| to encode multibyte text, it does perform one character code
conversion: it converts eight-bit characters to their single-byte
external representation. |raw-text| does not specify the end-of-line
conversion, allowing that to be determined as usual by the data, and has
the usual three variants which specify the end-of-line conversion.
|no-conversion| (and its alias |binary|) is equivalent to
|raw-text-unix|: it specifies no conversion of either character codes or
end-of-line.
The coding system |utf-8-emacs| specifies that the data is represented
in the internal Emacs encoding (see Text Representations
<#Text-Representations>). This is like |raw-text| in that no code
conversion happens, but different in that the result is multibyte data.
The name |emacs-internal| is an alias for |utf-8-emacs-unix| (so it
forces no conversion of end-of-line, unlike |utf-8-emacs|, which can
decode all 3 kinds of end-of-line conventions).
Function: *coding-system-get* /coding-system property/
This function returns the specified property of the coding system
coding-system. Most coding system properties exist for internal
purposes, but one that you might find useful is |:mime-charset|.
That property’s value is the name used in MIME for the character
coding which this coding system can read and write. Examples:
(coding-system-get 'iso-latin-1 :mime-charset)
⇒ iso-8859-1
(coding-system-get 'iso-2022-cn :mime-charset)
⇒ iso-2022-cn
(coding-system-get 'cyrillic-koi8 :mime-charset)
⇒ koi8-r
The value of the |:mime-charset| property is also defined as an
alias for the coding system.
Function: *coding-system-aliases* /coding-system/
This function returns the list of aliases of coding-system.
Next: Lisp and Coding Systems <#Lisp-and-Coding-Systems>, Previous:
Coding System Basics <#Coding-System-Basics>, Up: Coding Systems
<#Coding-Systems> [Contents <#SEC_Contents>][Index <#Index>]
33.10.2 Encoding and I/O
The principal purpose of coding systems is for use in reading and
writing files. The function |insert-file-contents| uses a coding system
to decode the file data, and |write-region| uses one to encode the
buffer contents.
You can specify the coding system to use either explicitly (see
Specifying Coding Systems <#Specifying-Coding-Systems>), or implicitly
using a default mechanism (see Default Coding Systems
<#Default-Coding-Systems>). But these methods may not completely specify
what to do. For example, they may choose a coding system such as
|undecided| which leaves the character code conversion to be determined
from the data. In these cases, the I/O operation finishes the job of
choosing a coding system. Very often you will want to find out
afterwards which coding system was chosen.
Variable: *buffer-file-coding-system*
This buffer-local variable records the coding system used for saving
the buffer and for writing part of the buffer with |write-region|.
If the text to be written cannot be safely encoded using the coding
system specified by this variable, these operations select an
alternative encoding by calling the function
|select-safe-coding-system| (see User-Chosen Coding Systems
<#User_002dChosen-Coding-Systems>). If selecting a different
encoding requires to ask the user to specify a coding system,
|buffer-file-coding-system| is updated to the newly selected coding
system.
|buffer-file-coding-system| does /not/ affect sending text to a
subprocess.
Variable: *save-buffer-coding-system*
This variable specifies the coding system for saving the buffer (by
overriding |buffer-file-coding-system|). Note that it is not used
for |write-region|.
When a command to save the buffer starts out to use
|buffer-file-coding-system| (or |save-buffer-coding-system|), and
that coding system cannot handle the actual text in the buffer, the
command asks the user to choose another coding system (by calling
|select-safe-coding-system|). After that happens, the command also
updates |buffer-file-coding-system| to represent the coding system
that the user specified.
Variable: *last-coding-system-used*
I/O operations for files and subprocesses set this variable to the
coding system name that was used. The explicit encoding and decoding
functions (see Explicit Encoding <#Explicit-Encoding>) set it too.
*Warning:* Since receiving subprocess output sets this variable, it
can change whenever Emacs waits; therefore, you should copy the
value shortly after the function call that stores the value you are
interested in.
The variable |selection-coding-system| specifies how to encode
selections for the window system. See Window System Selections
<#Window-System-Selections>.
Variable: *file-name-coding-system*
The variable |file-name-coding-system| specifies the coding system
to use for encoding file names. Emacs encodes file names using that
coding system for all file operations. If |file-name-coding-system|
is |nil|, Emacs uses a default coding system determined by the
selected language environment. In the default language environment,
any non-ASCII characters in file names are not encoded specially;
they appear in the file system using the internal Emacs representation.
*Warning:* if you change |file-name-coding-system| (or the language
environment) in the middle of an Emacs session, problems can result if
you have already visited files whose names were encoded using the
earlier coding system and are handled differently under the new coding
system. If you try to save one of these buffers under the visited file
name, saving may use the wrong file name, or it may get an error. If
such a problem happens, use C-x C-w to specify a new file name for that
buffer.
On Windows 2000 and later, Emacs by default uses Unicode APIs to pass
file names to the OS, so the value of |file-name-coding-system| is
largely ignored. Lisp applications that need to encode or decode file
names on the Lisp level should use |utf-8| coding-system when
|system-type| is |windows-nt|; the conversion of UTF-8 encoded file
names to the encoding appropriate for communicating with the OS is
performed internally by Emacs.
Next: User-Chosen Coding Systems <#User_002dChosen-Coding-Systems>,
Previous: Encoding and I/O <#Encoding-and-I_002fO>, Up: Coding Systems
<#Coding-Systems> [Contents <#SEC_Contents>][Index <#Index>]
33.10.3 Coding Systems in Lisp
Here are the Lisp facilities for working with coding systems:
Function: *coding-system-list* /&optional base-only/
This function returns a list of all coding system names (symbols).
If base-only is non-|nil|, the value includes only the base coding
systems. Otherwise, it includes alias and variant coding systems as
well.
Function: *coding-system-p* /object/
This function returns |t| if object is a coding system name or |nil|.
Function: *check-coding-system* /coding-system/
This function checks the validity of coding-system. If that is
valid, it returns coding-system. If coding-system is |nil|, the
function return |nil|. For any other values, it signals an error
whose |error-symbol| is |coding-system-error| (see signal
<#Signaling-Errors>).
Function: *coding-system-eol-type* /coding-system/
This function returns the type of end-of-line (a.k.a. /eol/)
conversion used by coding-system. If coding-system specifies a
certain eol conversion, the return value is an integer 0, 1, or 2,
standing for |unix|, |dos|, and |mac|, respectively. If
coding-system doesn’t specify eol conversion explicitly, the return
value is a vector of coding systems, each one with one of the
possible eol conversion types, like this:
(coding-system-eol-type 'latin-1)
⇒ [latin-1-unix latin-1-dos latin-1-mac]
If this function returns a vector, Emacs will decide, as part of the
text encoding or decoding process, what eol conversion to use. For
decoding, the end-of-line format of the text is auto-detected, and
the eol conversion is set to match it (e.g., DOS-style CRLF format
will imply |dos| eol conversion). For encoding, the eol conversion
is taken from the appropriate default coding system (e.g., default
value of |buffer-file-coding-system| for
|buffer-file-coding-system|), or from the default eol conversion
appropriate for the underlying platform.
Function: *coding-system-change-eol-conversion* /coding-system eol-type/
This function returns a coding system which is like coding-system
except for its eol conversion, which is specified by |eol-type|.
eol-type should be |unix|, |dos|, |mac|, or |nil|. If it is |nil|,
the returned coding system determines the end-of-line conversion
from the data.
eol-type may also be 0, 1 or 2, standing for |unix|, |dos| and
|mac|, respectively.
Function: *coding-system-change-text-conversion* /eol-coding text-coding/
This function returns a coding system which uses the end-of-line
conversion of eol-coding, and the text conversion of text-coding. If
text-coding is |nil|, it returns |undecided|, or one of its variants
according to eol-coding.
Function: *find-coding-systems-region* /from to/
This function returns a list of coding systems that could be used to
encode a text between from and to. All coding systems in the list
can safely encode any multibyte characters in that portion of the text.
If the text contains no multibyte characters, the function returns
the list |(undecided)|.
Function: *find-coding-systems-string* /string/
This function returns a list of coding systems that could be used to
encode the text of string. All coding systems in the list can safely
encode any multibyte characters in string. If the text contains no
multibyte characters, this returns the list |(undecided)|.
Function: *find-coding-systems-for-charsets* /charsets/
This function returns a list of coding systems that could be used to
encode all the character sets in the list charsets.
Function: *check-coding-systems-region* /start end coding-system-list/
This function checks whether coding systems in the list
|coding-system-list| can encode all the characters in the region
between start and end. If all of the coding systems in the list can
encode the specified text, the function returns |nil|. If some
coding systems cannot encode some of the characters, the value is an
alist, each element of which has the form |(coding-system1 pos1 pos2
…)|, meaning that coding-system1 cannot encode characters at buffer
positions pos1, pos2, ....
start may be a string, in which case end is ignored and the returned
value references string indices instead of buffer positions.
Function: *detect-coding-region* /start end &optional highest/
This function chooses a plausible coding system for decoding the
text from start to end. This text should be a byte sequence, i.e.,
unibyte text or multibyte text with only ASCII and eight-bit
characters (see Explicit Encoding <#Explicit-Encoding>).
Normally this function returns a list of coding systems that could
handle decoding the text that was scanned. They are listed in order
of decreasing priority. But if highest is non-|nil|, then the return
value is just one coding system, the one that is highest in priority.
If the region contains only ASCII characters except for such
ISO-2022 control characters ISO-2022 as |ESC|, the value is
|undecided| or |(undecided)|, or a variant specifying end-of-line
conversion, if that can be deduced from the text.
If the region contains null bytes, the value is |no-conversion|,
even if the region contains text encoded in some coding system.
Function: *detect-coding-string* /string &optional highest/
This function is like |detect-coding-region| except that it operates
on the contents of string instead of bytes in the buffer.
Variable: *inhibit-nul-byte-detection*
If this variable has a non-|nil| value, null bytes are ignored when
detecting the encoding of a region or a string. This allows the
encoding of text that contains null bytes to be correctly detected,
such as Info files with Index nodes.
Variable: *inhibit-iso-escape-detection*
If this variable has a non-|nil| value, ISO-2022 escape sequences
are ignored when detecting the encoding of a region or a string. The
result is that no text is ever detected as encoded in some ISO-2022
encoding, and all escape sequences become visible in a buffer.
*Warning:* /Use this variable with extreme caution, because many
files in the Emacs distribution use ISO-2022 encoding./
Function: *coding-system-charset-list* /coding-system/
This function returns the list of character sets (see Character Sets
<#Character-Sets>) supported by coding-system. Some coding systems
that support too many character sets to list them all yield special
values:
* If coding-system supports all Emacs characters, the value is
|(emacs)|.
* If coding-system supports all Unicode characters, the value is
|(unicode)|.
* If coding-system supports all ISO-2022 charsets, the value is
|iso-2022|.
* If coding-system supports all the characters in the internal
coding system used by Emacs version 21 (prior to the
implementation of internal Unicode support), the value is
|emacs-mule|.
See Process Information <#Coding-systems-for-a-subprocess>, in
particular the description of the functions |process-coding-system| and
|set-process-coding-system|, for how to examine or set the coding
systems used for I/O to a subprocess.
Next: Default Coding Systems <#Default-Coding-Systems>, Previous: Lisp
and Coding Systems <#Lisp-and-Coding-Systems>, Up: Coding Systems
<#Coding-Systems> [Contents <#SEC_Contents>][Index <#Index>]
33.10.4 User-Chosen Coding Systems
Function: *select-safe-coding-system* /from to &optional
default-coding-system accept-default-p file/
This function selects a coding system for encoding specified text,
asking the user to choose if necessary. Normally the specified text
is the text in the current buffer between from and to. If from is a
string, the string specifies the text to encode, and to is ignored.
If the specified text includes raw bytes (see Text Representations
<#Text-Representations>), |select-safe-coding-system| suggests
|raw-text| for its encoding.
If default-coding-system is non-|nil|, that is the first coding
system to try; if that can handle the text,
|select-safe-coding-system| returns that coding system. It can also
be a list of coding systems; then the function tries each of them
one by one. After trying all of them, it next tries the current
buffer’s value of |buffer-file-coding-system| (if it is not
|undecided|), then the default value of |buffer-file-coding-system|
and finally the user’s most preferred coding system, which the user
can set using the command |prefer-coding-system| (see Recognizing
Coding Systems
in The GNU Emacs Manual).
If one of those coding systems can safely encode all the specified
text, |select-safe-coding-system| chooses it and returns it.
Otherwise, it asks the user to choose from a list of coding systems
which can encode all the text, and returns the user’s choice.
default-coding-system can also be a list whose first element is |t|
and whose other elements are coding systems. Then, if no coding
system in the list can handle the text, |select-safe-coding-system|
queries the user immediately, without trying any of the three
alternatives described above. This is handy for checking only the
coding systems in the list.
The optional argument accept-default-p determines whether a coding
system selected without user interaction is acceptable. If it’s
omitted or |nil|, such a silent selection is always acceptable. If
it is non-|nil|, it should be a function;
|select-safe-coding-system| calls this function with one argument,
the base coding system of the selected coding system. If the
function returns |nil|, |select-safe-coding-system| rejects the
silently selected coding system, and asks the user to select a
coding system from a list of possible candidates.
If the variable |select-safe-coding-system-accept-default-p| is
non-|nil|, it should be a function taking a single argument. It is
used in place of accept-default-p, overriding any value supplied for
this argument.
As a final step, before returning the chosen coding system,
|select-safe-coding-system| checks whether that coding system is
consistent with what would be selected if the contents of the region
were read from a file. (If not, this could lead to data corruption
in a file subsequently re-visited and edited.) Normally,
|select-safe-coding-system| uses |buffer-file-name| as the file for
this purpose, but if file is non-|nil|, it uses that file instead
(this can be relevant for |write-region| and similar functions). If
it detects an apparent inconsistency, |select-safe-coding-system|
queries the user before selecting the coding system.
Variable: *select-safe-coding-system-function*
This variable names the function to be called to request the user to
select a proper coding system for encoding text when the default
coding system for an output operation cannot safely encode that
text. The default value of this variable is
|select-safe-coding-system|. Emacs primitives that write text to
files, such as |write-region|, or send text to other processes, such
as |process-send-region|, normally call the value of this variable,
unless |coding-system-for-write| is bound to a non-|nil| value (see
Specifying Coding Systems <#Specifying-Coding-Systems>).
Here are two functions you can use to let the user specify a coding
system, with completion. See Completion <#Completion>.
Function: *read-coding-system* /prompt &optional default/
This function reads a coding system using the minibuffer, prompting
with string prompt, and returns the coding system name as a symbol.
If the user enters null input, default specifies which coding system
to return. It should be a symbol or a string.
Function: *read-non-nil-coding-system* /prompt/
This function reads a coding system using the minibuffer, prompting
with string prompt, and returns the coding system name as a symbol.
If the user tries to enter null input, it asks the user to try
again. See Coding Systems <#Coding-Systems>.
Next: Specifying Coding Systems <#Specifying-Coding-Systems>, Previous:
User-Chosen Coding Systems <#User_002dChosen-Coding-Systems>, Up: Coding
Systems <#Coding-Systems> [Contents <#SEC_Contents>][Index <#Index>]
33.10.5 Default Coding Systems
This section describes variables that specify the default coding system
for certain files or when running certain subprograms, and the function
that I/O operations use to access them.
The idea of these variables is that you set them once and for all to the
defaults you want, and then do not change them again. To specify a
particular coding system for a particular operation in a Lisp program,
don’t change these variables; instead, override them using
|coding-system-for-read| and |coding-system-for-write| (see Specifying
Coding Systems <#Specifying-Coding-Systems>).
User Option: *auto-coding-regexp-alist*
This variable is an alist of text patterns and corresponding coding
systems. Each element has the form |(regexp . coding-system)|; a
file whose first few kilobytes match regexp is decoded with
coding-system when its contents are read into a buffer. The settings
in this alist take priority over |coding:| tags in the files and the
contents of |file-coding-system-alist| (see below). The default
value is set so that Emacs automatically recognizes mail files in
Babyl format and reads them with no code conversions.
User Option: *file-coding-system-alist*
This variable is an alist that specifies the coding systems to use
for reading and writing particular files. Each element has the form
|(pattern . coding)|, where pattern is a regular expression that
matches certain file names. The element applies to file names that
match pattern.
The CDR of the element, coding, should be either a coding system, a
cons cell containing two coding systems, or a function name (a
symbol with a function definition). If coding is a coding system,
that coding system is used for both reading the file and writing it.
If coding is a cons cell containing two coding systems, its CAR
specifies the coding system for decoding, and its CDR specifies the
coding system for encoding.
If coding is a function name, the function should take one argument,
a list of all arguments passed to |find-operation-coding-system|. It
must return a coding system or a cons cell containing two coding
systems. This value has the same meaning as described above.
If coding (or what returned by the above function) is |undecided|,
the normal code-detection is performed.
User Option: *auto-coding-alist*
This variable is an alist that specifies the coding systems to use
for reading and writing particular files. Its form is like that of
|file-coding-system-alist|, but, unlike the latter, this variable
takes priority over any |coding:| tags in the file.
Variable: *process-coding-system-alist*
This variable is an alist specifying which coding systems to use for
a subprocess, depending on which program is running in the
subprocess. It works like |file-coding-system-alist|, except that
pattern is matched against the program name used to start the
subprocess. The coding system or systems specified in this alist are
used to initialize the coding systems used for I/O to the
subprocess, but you can specify other coding systems later using
|set-process-coding-system|.
*Warning:* Coding systems such as |undecided|, which determine the
coding system from the data, do not work entirely reliably with
asynchronous subprocess output. This is because Emacs handles
asynchronous subprocess output in batches, as it arrives. If the coding
system leaves the character code conversion unspecified, or leaves the
end-of-line conversion unspecified, Emacs must try to detect the proper
conversion from one batch at a time, and this does not always work.
Therefore, with an asynchronous subprocess, if at all possible, use a
coding system which determines both the character code conversion and
the end of line conversion—that is, one like |latin-1-unix|, rather than
|undecided| or |latin-1|.
Variable: *network-coding-system-alist*
This variable is an alist that specifies the coding system to use
for network streams. It works much like |file-coding-system-alist|,
with the difference that the pattern in an element may be either a
port number or a regular expression. If it is a regular expression,
it is matched against the network service name used to open the
network stream.
Variable: *default-process-coding-system*
This variable specifies the coding systems to use for subprocess
(and network stream) input and output, when nothing else specifies
what to do.
The value should be a cons cell of the form |(input-coding .
output-coding)|. Here input-coding applies to input from the
subprocess, and output-coding applies to output to it.
User Option: *auto-coding-functions*
This variable holds a list of functions that try to determine a
coding system for a file based on its undecoded contents.
Each function in this list should be written to look at text in the
current buffer, but should not modify it in any way. The buffer will
contain the text of parts of the file. Each function should take one
argument, size, which tells it how many characters to look at,
starting from point. If the function succeeds in determining a
coding system for the file, it should return that coding system.
Otherwise, it should return |nil|.
The functions in this list could be called either when the file is
visited and Emacs wants to decode its contents, and/or when the
file’s buffer is about to be saved and Emacs wants to determine how
to encode its contents.
If a file has a ‘coding:’ tag, that takes precedence, so these
functions won’t be called.
Function: *find-auto-coding* /filename size/
This function tries to determine a suitable coding system for
filename. It examines the buffer visiting the named file, using the
variables documented above in sequence, until it finds a match for
one of the rules specified by these variables. It then returns a
cons cell of the form |(coding . source)|, where coding is the
coding system to use and source is a symbol, one of
|auto-coding-alist|, |auto-coding-regexp-alist|, |:coding|, or
|auto-coding-functions|, indicating which one supplied the matching
rule. The value |:coding| means the coding system was specified by
the |coding:| tag in the file (see coding tag
in The GNU Emacs Manual). The order of looking for a matching rule
is |auto-coding-alist| first, then |auto-coding-regexp-alist|, then
the |coding:| tag, and lastly |auto-coding-functions|. If no
matching rule was found, the function returns |nil|.
The second argument size is the size of text, in characters,
following point. The function examines text only within size
characters after point. Normally, the buffer should be positioned at
the beginning when this function is called, because one of the
places for the |coding:| tag is the first one or two lines of the
file; in that case, size should be the size of the buffer.
Function: *set-auto-coding* /filename size/
This function returns a suitable coding system for file filename. It
uses |find-auto-coding| to find the coding system. If no coding
system could be determined, the function returns |nil|. The meaning
of the argument size is like in |find-auto-coding|.
Function: *find-operation-coding-system* /operation &rest arguments/
This function returns the coding system to use (by default) for
performing operation with arguments. The value has this form:
(decoding-system . encoding-system)
The first element, decoding-system, is the coding system to use for
decoding (in case operation does decoding), and encoding-system is
the coding system for encoding (in case operation does encoding).
The argument operation is a symbol; it should be one of
|write-region|, |start-process|, |call-process|,
|call-process-region|, |insert-file-contents|, or
|open-network-stream|. These are the names of the Emacs I/O
primitives that can do character code and eol conversion.
The remaining arguments should be the same arguments that might be
given to the corresponding I/O primitive. Depending on the
primitive, one of those arguments is selected as the /target/. For
example, if operation does file I/O, whichever argument specifies
the file name is the target. For subprocess primitives, the process
name is the target. For |open-network-stream|, the target is the
service name or port number.
Depending on operation, this function looks up the target in
|file-coding-system-alist|, |process-coding-system-alist|, or
|network-coding-system-alist|. If the target is found in the alist,
|find-operation-coding-system| returns its association in the alist;
otherwise it returns |nil|.
If operation is |insert-file-contents|, the argument corresponding
to the target may be a cons cell of the form |(filename . buffer)|.
In that case, filename is a file name to look up in
|file-coding-system-alist|, and buffer is a buffer that contains the
file’s contents (not yet decoded). If |file-coding-system-alist|
specifies a function to call for this file, and that function needs
to examine the file’s contents (as it usually does), it should
examine the contents of buffer instead of reading the file.
Next: Explicit Encoding <#Explicit-Encoding>, Previous: Default Coding
Systems <#Default-Coding-Systems>, Up: Coding Systems <#Coding-Systems>
[Contents <#SEC_Contents>][Index <#Index>]
33.10.6 Specifying a Coding System for One Operation
You can specify the coding system for a specific operation by binding
the variables |coding-system-for-read| and/or |coding-system-for-write|.
Variable: *coding-system-for-read*
If this variable is non-|nil|, it specifies the coding system to use
for reading a file, or for input from a synchronous subprocess.
It also applies to any asynchronous subprocess or network stream,
but in a different way: the value of |coding-system-for-read| when
you start the subprocess or open the network stream specifies the
input decoding method for that subprocess or network stream. It
remains in use for that subprocess or network stream unless and
until overridden.
The right way to use this variable is to bind it with |let| for a
specific I/O operation. Its global value is normally |nil|, and you
should not globally set it to any other value. Here is an example of
the right way to use the variable:
;; Read the file with no character code conversion.
(let ((coding-system-for-read 'no-conversion))
(insert-file-contents filename))
When its value is non-|nil|, this variable takes precedence over all
other methods of specifying a coding system to use for input,
including |file-coding-system-alist|, |process-coding-system-alist|
and |network-coding-system-alist|.
Variable: *coding-system-for-write*
This works much like |coding-system-for-read|, except that it
applies to output rather than input. It affects writing to files, as
well as sending output to subprocesses and net connections. It also
applies to encoding command-line arguments with which Emacs invokes
subprocesses.
When a single operation does both input and output, as do
|call-process-region| and |start-process|, both
|coding-system-for-read| and |coding-system-for-write| affect it.
Variable: *coding-system-require-warning*
Binding |coding-system-for-write| to a non-|nil| value prevents
output primitives from calling the function specified by
|select-safe-coding-system-function| (see User-Chosen Coding Systems
<#User_002dChosen-Coding-Systems>). This is because C-x RET c
(|universal-coding-system-argument|) works by binding
|coding-system-for-write|, and Emacs should obey user selection. If
a Lisp program binds |coding-system-for-write| to a value that might
not be safe for encoding the text to be written, it can also bind
|coding-system-require-warning| to a non-|nil| value, which will
force the output primitives to check the encoding by calling the
value of |select-safe-coding-system-function| even though
|coding-system-for-write| is non-|nil|. Alternatively, call
|select-safe-coding-system| explicitly before using the specified
encoding.
User Option: *inhibit-eol-conversion*
When this variable is non-|nil|, no end-of-line conversion is done,
no matter which coding system is specified. This applies to all the
Emacs I/O and subprocess primitives, and to the explicit encoding
and decoding functions (see Explicit Encoding <#Explicit-Encoding>).
Sometimes, you need to prefer several coding systems for some operation,
rather than fix a single one. Emacs lets you specify a priority order
for using coding systems. This ordering affects the sorting of lists of
coding systems returned by functions such as
|find-coding-systems-region| (see Lisp and Coding Systems
<#Lisp-and-Coding-Systems>).
Function: *coding-system-priority-list* /&optional highestp/
This function returns the list of coding systems in the order of
their current priorities. Optional argument highestp, if non-|nil|,
means return only the highest priority coding system.
Function: *set-coding-system-priority* /&rest coding-systems/
This function puts coding-systems at the beginning of the priority
list for coding systems, thus making their priority higher than all
the rest.
Macro: *with-coding-priority* /coding-systems &rest body/
This macro executes body, like |progn| does (see progn
<#Sequencing>), with coding-systems at the front of the priority
list for coding systems. coding-systems should be a list of coding
systems to prefer during execution of body.
Next: Terminal I/O Encoding <#Terminal-I_002fO-Encoding>, Previous:
Specifying Coding Systems <#Specifying-Coding-Systems>, Up: Coding
Systems <#Coding-Systems> [Contents <#SEC_Contents>][Index <#Index>]
33.10.7 Explicit Encoding and Decoding
All the operations that transfer text in and out of Emacs have the
ability to use a coding system to encode or decode the text. You can
also explicitly encode and decode text using the functions in this section.
The result of encoding, and the input to decoding, are not ordinary
text. They logically consist of a series of byte values; that is, a
series of ASCII and eight-bit characters. In unibyte buffers and
strings, these characters have codes in the range 0 through #xFF (255).
In a multibyte buffer or string, eight-bit characters have character
codes higher than #xFF (see Text Representations
<#Text-Representations>), but Emacs transparently converts them to their
single-byte values when you encode or decode such text.
The usual way to read a file into a buffer as a sequence of bytes, so
you can decode the contents explicitly, is with
|insert-file-contents-literally| (see Reading from Files
<#Reading-from-Files>); alternatively, specify a non-|nil| rawfile
argument when visiting a file with |find-file-noselect|. These methods
result in a unibyte buffer.
The usual way to use the byte sequence that results from explicitly
encoding text is to copy it to a file or process—for example, to write
it with |write-region| (see Writing to Files <#Writing-to-Files>), and
suppress encoding by binding |coding-system-for-write| to |no-conversion|.
Here are the functions to perform explicit encoding or decoding. The
encoding functions produce sequences of bytes; the decoding functions
are meant to operate on sequences of bytes. All of these functions
discard text properties. They also set |last-coding-system-used| to the
precise coding system they used.
Command: *encode-coding-region* /start end coding-system &optional
destination/
This command encodes the text from start to end according to coding
system coding-system. Normally, the encoded text replaces the
original text in the buffer, but the optional argument destination
can change that. If destination is a buffer, the encoded text is
inserted in that buffer after point (point does not move); if it is
|t|, the command returns the encoded text as a unibyte string
without inserting it.
If encoded text is inserted in some buffer, this command returns the
length of the encoded text.
The result of encoding is logically a sequence of bytes, but the
buffer remains multibyte if it was multibyte before, and any 8-bit
bytes are converted to their multibyte representation (see Text
Representations <#Text-Representations>).
Do /not/ use |undecided| for coding-system when encoding text, since
that may lead to unexpected results. Instead, use
|select-safe-coding-system| (see select-safe-coding-system
<#User_002dChosen-Coding-Systems>) to suggest a suitable encoding,
if there’s no obvious pertinent value for coding-system.
Function: *encode-coding-string* /string coding-system &optional nocopy
buffer/
This function encodes the text in string according to coding system
coding-system. It returns a new string containing the encoded text,
except when nocopy is non-|nil|, in which case the function may
return string itself if the encoding operation is trivial. The
result of encoding is a unibyte string.
Command: *decode-coding-region* /start end coding-system &optional
destination/
This command decodes the text from start to end according to coding
system coding-system. To make explicit decoding useful, the text
before decoding ought to be a sequence of byte values, but both
multibyte and unibyte buffers are acceptable (in the multibyte case,
the raw byte values should be represented as eight-bit characters).
Normally, the decoded text replaces the original text in the buffer,
but the optional argument destination can change that. If
destination is a buffer, the decoded text is inserted in that buffer
after point (point does not move); if it is |t|, the command returns
the decoded text as a multibyte string without inserting it.
If decoded text is inserted in some buffer, this command returns the
length of the decoded text. If that buffer is a unibyte buffer (see
Selecting a Representation <#Selecting-a-Representation>), the
internal representation of the decoded text (see Text
Representations <#Text-Representations>) is inserted into the buffer
as individual bytes.
This command puts a |charset| text property on the decoded text. The
value of the property states the character set used to decode the
original text.
Function: *decode-coding-string* /string coding-system &optional nocopy
buffer/
This function decodes the text in string according to coding-system.
It returns a new string containing the decoded text, except when
nocopy is non-|nil|, in which case the function may return string
itself if the decoding operation is trivial. To make explicit
decoding useful, the contents of string ought to be a unibyte string
with a sequence of byte values, but a multibyte string is also
acceptable (assuming it contains 8-bit bytes in their multibyte form).
If optional argument buffer specifies a buffer, the decoded text is
inserted in that buffer after point (point does not move). In this
case, the return value is the length of the decoded text. If that
buffer is a unibyte buffer, the internal representation of the
decoded text is inserted into it as individual bytes.
This function puts a |charset| text property on the decoded text.
The value of the property states the character set used to decode
the original text:
(decode-coding-string "Gr\374ss Gott" 'latin-1)
⇒ #("Grüss Gott" 0 9 (charset iso-8859-1))
Function: *decode-coding-inserted-region* /from to filename &optional
visit beg end replace/
This function decodes the text from from to to as if it were being
read from file filename using |insert-file-contents| using the rest
of the arguments provided.
The normal way to use this function is after reading text from a
file without decoding, if you decide you would rather have decoded
it. Instead of deleting the text and reading it again, this time
with decoding, you can call this function.
Previous: Explicit Encoding <#Explicit-Encoding>, Up: Coding Systems
<#Coding-Systems> [Contents <#SEC_Contents>][Index <#Index>]
33.10.8 Terminal I/O Encoding
Emacs can use coding systems to decode keyboard input and encode
terminal output. This is useful for terminals that transmit or display
text using a particular encoding, such as Latin-1. Emacs does not set
|last-coding-system-used| when encoding or decoding terminal I/O.
Function: *keyboard-coding-system* /&optional terminal/
This function returns the coding system used for decoding keyboard
input from terminal. A value of |no-conversion| means no decoding is
done. If terminal is omitted or |nil|, it means the selected frame’s
terminal. See Multiple Terminals <#Multiple-Terminals>.
Command: *set-keyboard-coding-system* /coding-system &optional terminal/
This command specifies coding-system as the coding system to use for
decoding keyboard input from terminal. If coding-system is |nil|,
that means not to decode keyboard input. If terminal is a frame, it
means that frame’s terminal; if it is |nil|, that means the
currently selected frame’s terminal. See Multiple Terminals
<#Multiple-Terminals>.
Function: *terminal-coding-system* /&optional terminal/
This function returns the coding system that is in use for encoding
terminal output from terminal. A value of |no-conversion| means no
encoding is done. If terminal is a frame, it means that frame’s
terminal; if it is |nil|, that means the currently selected frame’s
terminal.
Command: *set-terminal-coding-system* /coding-system &optional terminal/
This command specifies coding-system as the coding system to use for
encoding terminal output from terminal. If coding-system is |nil|,
that means not to encode terminal output. If terminal is a frame, it
means that frame’s terminal; if it is |nil|, that means the
currently selected frame’s terminal.
Next: Locales <#Locales>, Previous: Coding Systems <#Coding-Systems>,
Up: Non-ASCII Characters <#Non_002dASCII-Characters> [Contents
<#SEC_Contents>][Index <#Index>]
33.11 Input Methods
/Input methods/ provide convenient ways of entering non-ASCII characters
from the keyboard. Unlike coding systems, which translate non-ASCII
characters to and from encodings meant to be read by programs, input
methods provide human-friendly commands. (See Input Methods
in The GNU Emacs Manual, for information on how users use input methods
to enter text.) How to define input methods is not yet documented in
this manual, but here we describe how to use them.
Each input method has a name, which is currently a string; in the
future, symbols may also be usable as input method names.
Variable: *current-input-method*
This variable holds the name of the input method now active in the
current buffer. (It automatically becomes local in each buffer when
set in any fashion.) It is |nil| if no input method is active in the
buffer now.
User Option: *default-input-method*
This variable holds the default input method for commands that
choose an input method. Unlike |current-input-method|, this variable
is normally global.
Command: *set-input-method* /input-method/
This command activates input method input-method for the current
buffer. It also sets |default-input-method| to input-method. If
input-method is |nil|, this command deactivates any input method for
the current buffer.
Function: *read-input-method-name* /prompt &optional default inhibit-null/
This function reads an input method name with the minibuffer,
prompting with prompt. If default is non-|nil|, that is returned by
default, if the user enters empty input. However, if inhibit-null is
non-|nil|, empty input signals an error.
The returned value is a string.
Variable: *input-method-alist*
This variable defines all the supported input methods. Each element
defines one input method, and should have the form:
(input-method language-env activate-func
title description args...)
Here input-method is the input method name, a string; language-env
is another string, the name of the language environment this input
method is recommended for. (That serves only for documentation
purposes.)
activate-func is a function to call to activate this method. The
args, if any, are passed as arguments to activate-func. All told,
the arguments to activate-func are input-method and the args.
title is a string to display in the mode line while this method is
active. description is a string describing this method and what it
is good for.
The fundamental interface to input methods is through the variable
|input-method-function|. See Reading One Event <#Reading-One-Event>, and
Invoking the Input Method <#Invoking-the-Input-Method>.
Previous: Input Methods <#Input-Methods>, Up: Non-ASCII Characters
<#Non_002dASCII-Characters> [Contents <#SEC_Contents>][Index <#Index>]
33.12 Locales
In POSIX, locales control which language to use in language-related
features. These Emacs variables control how Emacs interacts with these
features.
Variable: *locale-coding-system*
This variable specifies the coding system to use for decoding system
error messages and—on X Window system only—keyboard input, for
sending batch output to the standard output and error streams, for
encoding the format argument to |format-time-string|, and for
decoding the return value of |format-time-string|.
Variable: *system-messages-locale*
This variable specifies the locale to use for generating system
error messages. Changing the locale can cause messages to come out
in a different language or in a different orthography. If the
variable is |nil|, the locale is specified by environment variables
in the usual POSIX fashion.
Variable: *system-time-locale*
This variable specifies the locale to use for formatting time
values. Changing the locale can cause messages to appear according
to the conventions of a different language. If the variable is
|nil|, the locale is specified by environment variables in the usual
POSIX fashion.
Function: *locale-info* /item/
This function returns locale data item for the current POSIX locale,
if available. item should be one of these symbols:
|codeset|
Return the character set as a string (locale item |CODESET|).
|days|
Return a 7-element vector of day names (locale items |DAY_1|
through |DAY_7|);
|months|
Return a 12-element vector of month names (locale items |MON_1|
through |MON_12|).
|paper|
Return a list |(width height)| of 2 integers, for the default
paper size measured in millimeters (locale items
|_NL_PAPER_WIDTH| and |_NL_PAPER_HEIGHT|).
If the system can’t provide the requested information, or if item is
not one of those symbols, the value is |nil|. All strings in the
return value are decoded using |locale-coding-system|. See Locales
in The GNU Libc Manual, for more information about locales and
locale items.
Next: Syntax Tables <#Syntax-Tables>, Previous: Non-ASCII Characters
<#Non_002dASCII-Characters>, Up: Top <#Top> [Contents
<#SEC_Contents>][Index <#Index>]
34 Searching and Matching
GNU Emacs provides two ways to search through a buffer for specified
text: exact string searches and regular expression searches. After a
regular expression search, you can examine the /match data/ to determine
which text matched the whole regular expression or various portions of it.
• String Search <#String-Search> Search for an exact match.
• Searching and Case <#Searching-and-Case> Case-independent or
case-significant searching.
• Regular Expressions <#Regular-Expressions> Describing classes of
strings.
• Regexp Search <#Regexp-Search> Searching for a match for a regexp.
• POSIX Regexps <#POSIX-Regexps> Searching POSIX-style for the
longest match.
• Match Data <#Match-Data> Finding out which part of the text
matched, after a string or regexp search.
• Search and Replace <#Search-and-Replace> Commands that loop,
searching and replacing.
• Standard Regexps <#Standard-Regexps> Useful regexps for finding
sentences, pages,...
The ‘skip-chars…’ functions also perform a kind of searching. See
Skipping Characters <#Skipping-Characters>. To search for changes in
character properties, see Property Search <#Property-Search>.
Next: Searching and Case <#Searching-and-Case>, Up: Searching and
Matching <#Searching-and-Matching> [Contents <#SEC_Contents>][Index
<#Index>]
34.1 Searching for Strings
These are the primitive functions for searching through the text in a
buffer. They are meant for use in programs, but you may call them
interactively. If you do so, they prompt for the search string; the
arguments limit and noerror are |nil|, and repeat is 1. For more details
on interactive searching, see Searching and Replacement
in The GNU Emacs Manual.
These search functions convert the search string to multibyte if the
buffer is multibyte; they convert the search string to unibyte if the
buffer is unibyte. See Text Representations <#Text-Representations>.
Command: *search-forward* /string &optional limit noerror count/
This function searches forward from point for an exact match for
string. If successful, it sets point to the end of the occurrence
found, and returns the new value of point. If no match is found, the
value and side effects depend on noerror (see below).
In the following example, point is initially at the beginning of the
line. Then |(search-forward "fox")| moves point after the last
letter of ‘fox’:
---------- Buffer: foo ----------
∗The quick brown fox jumped over the lazy dog.
---------- Buffer: foo ----------
(search-forward "fox")
⇒ 20
---------- Buffer: foo ----------
The quick brown fox∗ jumped over the lazy dog.
---------- Buffer: foo ----------
The argument limit specifies the bound to the search, and should be
a position in the current buffer. No match extending after that
position is accepted. If limit is omitted or |nil|, it defaults to
the end of the accessible portion of the buffer.
What happens when the search fails depends on the value of noerror.
If noerror is |nil|, a |search-failed| error is signaled. If noerror
is |t|, |search-forward| returns |nil| and does nothing. If noerror
is neither |nil| nor |t|, then |search-forward| moves point to the
upper bound and returns |nil|.
The argument noerror only affects valid searches which fail to find
a match. Invalid arguments cause errors regardless of noerror.
If count is a positive number n, the search is done n times; each
successive search starts at the end of the previous match. If all
these successive searches succeed, the function call succeeds,
moving point and returning its new value. Otherwise the function
call fails, with results depending on the value of noerror, as
described above. If count is a negative number -n, the search is
done n times in the opposite (backward) direction.
Command: *search-backward* /string &optional limit noerror count/
This function searches backward from point for string. It is like
|search-forward|, except that it searches backwards rather than
forwards. Backward searches leave point at the beginning of the match.
Command: *word-search-forward* /string &optional limit noerror count/
This function searches forward from point for a word match for
string. If it finds a match, it sets point to the end of the match
found, and returns the new value of point.
Word matching regards string as a sequence of words, disregarding
punctuation that separates them. It searches the buffer for the same
sequence of words. Each word must be distinct in the buffer
(searching for the word ‘ball’ does not match the word ‘balls’), but
the details of punctuation and spacing are ignored (searching for
‘ball boy’ does match ‘ball. Boy!’).
In this example, point is initially at the beginning of the buffer;
the search leaves it between the ‘y’ and the ‘!’.
---------- Buffer: foo ----------
∗He said "Please! Find
the ball boy!"
---------- Buffer: foo ----------
(word-search-forward "Please find the ball, boy.")
⇒ 39
---------- Buffer: foo ----------
He said "Please! Find
the ball boy∗!"
---------- Buffer: foo ----------
If limit is non-|nil|, it must be a position in the current buffer;
it specifies the upper bound to the search. The match found must not
extend after that position.
If noerror is |nil|, then |word-search-forward| signals an error if
the search fails. If noerror is |t|, then it returns |nil| instead
of signaling an error. If noerror is neither |nil| nor |t|, it moves
point to limit (or the end of the accessible portion of the buffer)
and returns |nil|.
If count is a positive number, it specifies how many successive
occurrences to search for. Point is positioned at the end of the
last match. If count is a negative number, the search is backward
and point is positioned at the beginning of the last match.
Internally, |word-search-forward| and related functions use the
function |word-search-regexp| to convert string to a regular
expression that ignores punctuation.
Command: *word-search-forward-lax* /string &optional limit noerror count/
This command is identical to |word-search-forward|, except that the
beginning or the end of string need not match a word boundary,
unless string begins or ends in whitespace. For instance, searching
for ‘ball boy’ matches ‘ball boyee’, but does not match ‘balls boy’.
Command: *word-search-backward* /string &optional limit noerror count/
This function searches backward from point for a word match to
string. This function is just like |word-search-forward| except that
it searches backward and normally leaves point at the beginning of
the match.
Command: *word-search-backward-lax* /string &optional limit noerror count/
This command is identical to |word-search-backward|, except that the
beginning or the end of string need not match a word boundary,
unless string begins or ends in whitespace.
Next: Regular Expressions <#Regular-Expressions>, Previous: String
Search <#String-Search>, Up: Searching and Matching
<#Searching-and-Matching> [Contents <#SEC_Contents>][Index <#Index>]
34.2 Searching and Case
By default, searches in Emacs ignore the case of the text they are
searching through; if you specify searching for ‘FOO’, then ‘Foo’ or
‘foo’ is also considered a match. This applies to regular expressions,
too; thus, ‘[aB]’ would match ‘a’ or ‘A’ or ‘b’ or ‘B’.
If you do not want this feature, set the variable |case-fold-search| to
|nil|. Then all letters must match exactly, including case. This is a
buffer-local variable; altering the variable affects only the current
buffer. (See Intro to Buffer-Local <#Intro-to-Buffer_002dLocal>.)
Alternatively, you may change the default value. In Lisp code, you will
more typically use |let| to bind |case-fold-search| to the desired value.
Note that the user-level incremental search feature handles case
distinctions differently. When the search string contains only lower
case letters, the search ignores case, but when the search string
contains one or more upper case letters, the search becomes
case-sensitive. But this has nothing to do with the searching functions
used in Lisp code. See Incremental Search
in The GNU Emacs Manual.
User Option: *case-fold-search*
This buffer-local variable determines whether searches should ignore
case. If the variable is |nil| they do not ignore case; otherwise
(and by default) they do ignore case.
User Option: *case-replace*
This variable determines whether the higher-level replacement
functions should preserve case. If the variable is |nil|, that means
to use the replacement text verbatim. A non-|nil| value means to
convert the case of the replacement text according to the text being
replaced.
This variable is used by passing it as an argument to the function
|replace-match|. See Replacing Match <#Replacing-Match>.
Next: Regexp Search <#Regexp-Search>, Previous: Searching and Case
<#Searching-and-Case>, Up: Searching and Matching
<#Searching-and-Matching> [Contents <#SEC_Contents>][Index <#Index>]
34.3 Regular Expressions
A /regular expression/, or /regexp/ for short, is a pattern that denotes
a (possibly infinite) set of strings. Searching for matches for a regexp
is a very powerful operation. This section explains how to write
regexps; the following section says how to search for them.
For interactive development of regular expressions, you can use the M-x
re-builder command. It provides a convenient interface for creating
regular expressions, by giving immediate visual feedback in a separate
buffer. As you edit the regexp, all its matches in the target buffer are
highlighted. Each parenthesized sub-expression of the regexp is shown in
a distinct face, which makes it easier to verify even very complex regexps.
• Syntax of Regexps <#Syntax-of-Regexps> Rules for writing regular
expressions.
• Regexp Example <#Regexp-Example> Illustrates regular expression
syntax.
• Rx Notation <#Rx-Notation> An alternative, structured regexp notation.
• Regexp Functions <#Regexp-Functions> Functions for operating on
regular expressions.
Next: Regexp Example <#Regexp-Example>, Up: Regular Expressions
<#Regular-Expressions> [Contents <#SEC_Contents>][Index <#Index>]
34.3.1 Syntax of Regular Expressions
Regular expressions have a syntax in which a few characters are special
constructs and the rest are /ordinary/. An ordinary character is a
simple regular expression that matches that character and nothing else.
The special characters are ‘.’, ‘*’, ‘+’, ‘?’, ‘[’, ‘^’, ‘$’, and ‘\’;
no new special characters will be defined in the future. The character
‘]’ is special if it ends a character alternative (see later). The
character ‘-’ is special inside a character alternative. A ‘[:’ and
balancing ‘:]’ enclose a character class inside a character alternative.
Any other character appearing in a regular expression is ordinary,
unless a ‘\’ precedes it.
For example, ‘f’ is not a special character, so it is ordinary, and
therefore ‘f’ is a regular expression that matches the string ‘f’ and no
other string. (It does /not/ match the string ‘fg’, but it does match a
/part/ of that string.) Likewise, ‘o’ is a regular expression that
matches only ‘o’.
Any two regular expressions a and b can be concatenated. The result is a
regular expression that matches a string if a matches some amount of the
beginning of that string and b matches the rest of the string.
As a simple example, we can concatenate the regular expressions ‘f’ and
‘o’ to get the regular expression ‘fo’, which matches only the string
‘fo’. Still trivial. To do something more powerful, you need to use one
of the special regular expression constructs.
• Regexp Special <#Regexp-Special> Special characters in regular
expressions.
• Char Classes <#Char-Classes> Character classes used in regular
expressions.
• Regexp Backslash <#Regexp-Backslash> Backslash-sequences in regular
expressions.
Next: Char Classes <#Char-Classes>, Up: Syntax of Regexps
<#Syntax-of-Regexps> [Contents <#SEC_Contents>][Index <#Index>]
34.3.1.1 Special Characters in Regular Expressions
Here is a list of the characters that are special in a regular expression.
‘.’ (Period)
is a special character that matches any single character except a
newline. Using concatenation, we can make regular expressions like
‘a.b’, which matches any three-character string that begins with ‘a’
and ends with ‘b’.
‘*’
is not a construct by itself; it is a postfix operator that means to
match the preceding regular expression repetitively as many times as
possible. Thus, ‘o*’ matches any number of ‘o’s (including no ‘o’s).
‘*’ always applies to the /smallest/ possible preceding expression.
Thus, ‘fo*’ has a repeating ‘o’, not a repeating ‘fo’. It matches
‘f’, ‘fo’, ‘foo’, and so on.
The matcher processes a ‘*’ construct by matching, immediately, as
many repetitions as can be found. Then it continues with the rest of
the pattern. If that fails, backtracking occurs, discarding some of
the matches of the ‘*’-modified construct in the hope that this will
make it possible to match the rest of the pattern. For example, in
matching ‘ca*ar’ against the string ‘caaar’, the ‘a*’ first tries to
match all three ‘a’s; but the rest of the pattern is ‘ar’ and there
is only ‘r’ left to match, so this try fails. The next alternative
is for ‘a*’ to match only two ‘a’s. With this choice, the rest of
the regexp matches successfully.
*Warning:* Nested repetition operators can run for a very long time,
if they lead to ambiguous matching. For example, trying to match the
regular expression ‘\(x+y*\)*a’ against the string
‘xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxz’ could take hours before it
ultimately fails. Emacs must try each way of grouping the ‘x’s
before concluding that none of them can work. In general, avoid
expressions that can match the same string in multiple ways.
‘+’
is a postfix operator, similar to ‘*’ except that it must match the
preceding expression at least once. So, for example, ‘ca+r’ matches
the strings ‘car’ and ‘caaaar’ but not the string ‘cr’, whereas
‘ca*r’ matches all three strings.
‘?’
is a postfix operator, similar to ‘*’ except that it must match the
preceding expression either once or not at all. For example, ‘ca?r’
matches ‘car’ or ‘cr’; nothing else.
‘*?’, ‘+?’, ‘??’
These are /non-greedy/ variants of the operators ‘*’, ‘+’ and ‘?’.
Where those operators match the largest possible substring
(consistent with matching the entire containing expression), the
non-greedy variants match the smallest possible substring
(consistent with matching the entire containing expression).
For example, the regular expression ‘c[ad]*a’ when applied to the
string ‘cdaaada’ matches the whole string; but the regular
expression ‘c[ad]*?a’, applied to that same string, matches just
‘cda’. (The smallest possible match here for ‘[ad]*?’ that permits
the whole expression to match is ‘d’.)
‘[ … ]’
is a /character alternative/, which begins with ‘[’ and is
terminated by ‘]’. In the simplest case, the characters between the
two brackets are what this character alternative can match.
Thus, ‘[ad]’ matches either one ‘a’ or one ‘d’, and ‘[ad]*’ matches
any string composed of just ‘a’s and ‘d’s (including the empty
string). It follows that ‘c[ad]*r’ matches ‘cr’, ‘car’, ‘cdr’,
‘caddaar’, etc.
You can also include character ranges in a character alternative, by
writing the starting and ending characters with a ‘-’ between them.
Thus, ‘[a-z]’ matches any lower-case ASCII letter. Ranges may be
intermixed freely with individual characters, as in ‘[a-z$%.]’,
which matches any lower case ASCII letter or ‘$’, ‘%’ or period.
However, the ending character of one range should not be the
starting point of another one; for example, ‘[a-m-z]’ should be
avoided.
A character alternative can also specify named character classes
(see Char Classes <#Char-Classes>). This is a POSIX feature. For
example, ‘[[:ascii:]]’ matches any ASCII character. Using a
character class is equivalent to mentioning each of the characters
in that class; but the latter is not feasible in practice, since
some classes include thousands of different characters. A character
class should not appear as the lower or upper bound of a range.
The usual regexp special characters are not special inside a
character alternative. A completely different set of characters is
special: ‘]’, ‘-’ and ‘^’. To include ‘]’ in a character
alternative, put it at the beginning. To include ‘^’, put it
anywhere but at the beginning. To include ‘-’, put it at the end.
Thus, ‘[]^-]’ matches all three of these special characters. You
cannot use ‘\’ to escape these three characters, since ‘\’ is not
special here.
The following aspects of ranges are specific to Emacs, in that POSIX
allows but does not require this behavior and programs other than
Emacs may behave differently:
1. If |case-fold-search| is non-|nil|, ‘[a-z]’ also matches
upper-case letters.
2. A range is not affected by the locale’s collation sequence: it
always represents the set of characters with codepoints ranging
between those of its bounds, so that ‘[a-z]’ matches only ASCII
letters, even outside the C or POSIX locale.
3. If the lower bound of a range is greater than its upper bound,
the range is empty and represents no characters. Thus, ‘[z-a]’
always fails to match, and ‘[^z-a]’ matches any character,
including newline. However, a reversed range should always be
from the letter ‘z’ to the letter ‘a’ to make it clear that it
is not a typo; for example, ‘[+-*/]’ should be avoided, because
it matches only ‘/’ rather than the likely-intended four
characters.
Some kinds of character alternatives are not the best style even
though they have a well-defined meaning in Emacs. They include:
1. Although a range’s bound can be almost any character, it is
better style to stay within natural sequences of ASCII letters
and digits because most people have not memorized character code
tables. For example, ‘[.-9]’ is less clear than ‘[./0-9]’, and
‘[`-~]’ is less clear than ‘[`a-z{|}~]’. Unicode character
escapes can help here; for example, for most programmers
‘[ก-ฺ฿-๛]’ is less clear than ‘[\u0E01-\u0E3A\u0E3F-\u0E5B]’.
2. Although a character alternative can include duplicates, it is
better style to avoid them. For example, ‘[XYa-yYb-zX]’ is less
clear than ‘[XYa-z]’.
3. Although a range can denote just one, two, or three characters,
it is simpler to list the characters. For example, ‘[a-a0]’ is
less clear than ‘[a0]’, ‘[i-j]’ is less clear than ‘[ij]’, and
‘[i-k]’ is less clear than ‘[ijk]’.
4. Although a ‘-’ can appear at the beginning of a character
alternative or as the upper bound of a range, it is better style
to put ‘-’ by itself at the end of a character alternative. For
example, although ‘[-a-z]’ is valid, ‘[a-z-]’ is better style;
and although ‘[*--]’ is valid, ‘[*+,-]’ is clearer.
‘[^ … ]’
‘[^’ begins a /complemented character alternative/. This matches any
character except the ones specified. Thus, ‘[^a-z0-9A-Z]’ matches
all characters /except/ ASCII letters and digits.
‘^’ is not special in a character alternative unless it is the first
character. The character following the ‘^’ is treated as if it were
first (in other words, ‘-’ and ‘]’ are not special there).
A complemented character alternative can match a newline, unless
newline is mentioned as one of the characters not to match. This is
in contrast to the handling of regexps in programs such as |grep|.
You can specify named character classes, just like in character
alternatives. For instance, ‘[^[:ascii:]]’ matches any non-ASCII
character. See Char Classes <#Char-Classes>.
‘^’
When matching a buffer, ‘^’ matches the empty string, but only at
the beginning of a line in the text being matched (or the beginning
of the accessible portion of the buffer). Otherwise it fails to
match anything. Thus, ‘^foo’ matches a ‘foo’ that occurs at the
beginning of a line.
When matching a string instead of a buffer, ‘^’ matches at the
beginning of the string or after a newline character.
For historical compatibility reasons, ‘^’ can be used only at the
beginning of the regular expression, or after ‘\(’, ‘\(?:’ or ‘\|’.
‘$’
is similar to ‘^’ but matches only at the end of a line (or the end
of the accessible portion of the buffer). Thus, ‘x+$’ matches a
string of one ‘x’ or more at the end of a line.
When matching a string instead of a buffer, ‘$’ matches at the end
of the string or before a newline character.
For historical compatibility reasons, ‘$’ can be used only at the
end of the regular expression, or before ‘\)’ or ‘\|’.
‘\’
has two functions: it quotes the special characters (including ‘\’),
and it introduces additional special constructs.
Because ‘\’ quotes special characters, ‘\$’ is a regular expression
that matches only ‘$’, and ‘\[’ is a regular expression that matches
only ‘[’, and so on.
Note that ‘\’ also has special meaning in the read syntax of Lisp
strings (see String Type <#String-Type>), and must be quoted with
‘\’. For example, the regular expression that matches the ‘\’
character is ‘\\’. To write a Lisp string that contains the
characters ‘\\’, Lisp syntax requires you to quote each ‘\’ with
another ‘\’. Therefore, the read syntax for a regular expression
matching ‘\’ is |"\\\\"|.
*Please note:* For historical compatibility, special characters are
treated as ordinary ones if they are in contexts where their special
meanings make no sense. For example, ‘*foo’ treats ‘*’ as ordinary since
there is no preceding expression on which the ‘*’ can act. It is poor
practice to depend on this behavior; quote the special character anyway,
regardless of where it appears.
As a ‘\’ is not special inside a character alternative, it can never
remove the special meaning of ‘-’ or ‘]’. So you should not quote these
characters when they have no special meaning either. This would not
clarify anything, since backslashes can legitimately precede these
characters where they /have/ special meaning, as in ‘[^\]’ (|"[^\\]"|
for Lisp string syntax), which matches any single character except a
backslash.
In practice, most ‘]’ that occur in regular expressions close a
character alternative and hence are special. However, occasionally a
regular expression may try to match a complex pattern of literal ‘[’ and
‘]’. In such situations, it sometimes may be necessary to carefully
parse the regexp from the start to determine which square brackets
enclose a character alternative. For example, ‘[^][]]’ consists of the
complemented character alternative ‘[^][]’ (which matches any single
character that is not a square bracket), followed by a literal ‘]’.
The exact rules are that at the beginning of a regexp, ‘[’ is special
and ‘]’ not. This lasts until the first unquoted ‘[’, after which we are
in a character alternative; ‘[’ is no longer special (except when it
starts a character class) but ‘]’ is special, unless it immediately
follows the special ‘[’ or that ‘[’ followed by a ‘^’. This lasts until
the next special ‘]’ that does not end a character class. This ends the
character alternative and restores the ordinary syntax of regular
expressions; an unquoted ‘[’ is special again and a ‘]’ not.
Next: Regexp Backslash <#Regexp-Backslash>, Previous: Regexp Special
<#Regexp-Special>, Up: Syntax of Regexps <#Syntax-of-Regexps>
[Contents <#SEC_Contents>][Index <#Index>]
34.3.1.2 Character Classes
Below is a table of the classes you can use in a character alternative,
and what they mean. Note that the ‘[’ and ‘]’ characters that enclose
the class name are part of the name, so a regular expression using these
classes needs one more pair of brackets. For example, a regular
expression matching a sequence of one or more letters and digits would
be ‘[[:alnum:]]+’, not ‘[:alnum:]+’.
‘[:ascii:]’
This matches any ASCII character (codes 0–127).
‘[:alnum:]’
This matches any letter or digit. For multibyte characters, it
matches characters whose Unicode ‘general-category’ property (see
Character Properties <#Character-Properties>) indicates they are
alphabetic or decimal number characters.
‘[:alpha:]’
This matches any letter. For multibyte characters, it matches
characters whose Unicode ‘general-category’ property (see Character
Properties <#Character-Properties>) indicates they are alphabetic
characters.
‘[:blank:]’
This matches horizontal whitespace, as defined by Annex C of the
Unicode Technical Standard #18. In particular, it matches spaces,
tabs, and other characters whose Unicode ‘general-category’ property
(see Character Properties <#Character-Properties>) indicates they
are spacing separators.
‘[:cntrl:]’
This matches any character whose code is in the range 0–31.
‘[:digit:]’
This matches ‘0’ through ‘9’. Thus, ‘[-+[:digit:]]’ matches any
digit, as well as ‘+’ and ‘-’.
‘[:graph:]’
This matches graphic characters—everything except whitespace, ASCII
and non-ASCII control characters, surrogates, and codepoints
unassigned by Unicode, as indicated by the Unicode
‘general-category’ property (see Character Properties
<#Character-Properties>).
‘[:lower:]’
This matches any lower-case letter, as determined by the current
case table (see Case Tables <#Case-Tables>). If |case-fold-search|
is non-|nil|, this also matches any upper-case letter.
‘[:multibyte:]’
This matches any multibyte character (see Text Representations
<#Text-Representations>).
‘[:nonascii:]’
This matches any non-ASCII character.
‘[:print:]’
This matches any printing character—either whitespace, or a graphic
character matched by ‘[:graph:]’.
‘[:punct:]’
This matches any punctuation character. (At present, for multibyte
characters, it matches anything that has non-word syntax.)
‘[:space:]’
This matches any character that has whitespace syntax (see Syntax
Class Table <#Syntax-Class-Table>).
‘[:unibyte:]’
This matches any unibyte character (see Text Representations
<#Text-Representations>).
‘[:upper:]’
This matches any upper-case letter, as determined by the current
case table (see Case Tables <#Case-Tables>). If |case-fold-search|
is non-|nil|, this also matches any lower-case letter.
‘[:word:]’
This matches any character that has word syntax (see Syntax Class
Table <#Syntax-Class-Table>).
‘[:xdigit:]’
This matches the hexadecimal digits: ‘0’ through ‘9’, ‘a’ through
‘f’ and ‘A’ through ‘F’.
Previous: Char Classes <#Char-Classes>, Up: Syntax of Regexps
<#Syntax-of-Regexps> [Contents <#SEC_Contents>][Index <#Index>]
34.3.1.3 Backslash Constructs in Regular Expressions
For the most part, ‘\’ followed by any character matches only that
character. However, there are several exceptions: certain sequences
starting with ‘\’ that have special meanings. Here is a table of the
special ‘\’ constructs.
‘\|’
specifies an alternative. Two regular expressions a and b with ‘\|’
in between form an expression that matches anything that either a or
b matches.
Thus, ‘foo\|bar’ matches either ‘foo’ or ‘bar’ but no other string.
‘\|’ applies to the largest possible surrounding expressions. Only a
surrounding ‘\( … \)’ grouping can limit the grouping power of ‘\|’.
If you need full backtracking capability to handle multiple uses of
‘\|’, use the POSIX regular expression functions (see POSIX Regexps
<#POSIX-Regexps>).
‘\{m\}’
is a postfix operator that repeats the previous pattern exactly m
times. Thus, ‘x\{5\}’ matches the string ‘xxxxx’ and nothing else.
‘c[ad]\{3\}r’ matches string such as ‘caaar’, ‘cdddr’, ‘cadar’, and
so on.
‘\{m,n\}’
is a more general postfix operator that specifies repetition with a
minimum of m repeats and a maximum of n repeats. If m is omitted,
the minimum is 0; if n is omitted, there is no maximum. For both
forms, m and n, if specified, may be no larger than 2**16 - 1 .
For example, ‘c[ad]\{1,2\}r’ matches the strings ‘car’, ‘cdr’,
‘caar’, ‘cadr’, ‘cdar’, and ‘cddr’, and nothing else.
‘\{0,1\}’ or ‘\{,1\}’ is equivalent to ‘?’.
‘\{0,\}’ or ‘\{,\}’ is equivalent to ‘*’.
‘\{1,\}’ is equivalent to ‘+’.
‘\( … \)’
is a grouping construct that serves three purposes:
1. To enclose a set of ‘\|’ alternatives for other operations.
Thus, the regular expression ‘\(foo\|bar\)x’ matches either
‘foox’ or ‘barx’.
2. To enclose a complicated expression for the postfix operators
‘*’, ‘+’ and ‘?’ to operate on. Thus, ‘ba\(na\)*’ matches ‘ba’,
‘bana’, ‘banana’, ‘bananana’, etc., with any number (zero or
more) of ‘na’ strings.
3. To record a matched substring for future reference with ‘\digit’
(see below).
This last application is not a consequence of the idea of a
parenthetical grouping; it is a separate feature that was assigned
as a second meaning to the same ‘\( … \)’ construct because, in
practice, there was usually no conflict between the two meanings.
But occasionally there is a conflict, and that led to the
introduction of shy groups.
‘\(?: … \)’
is the /shy group/ construct. A shy group serves the first two
purposes of an ordinary group (controlling the nesting of other
operators), but it does not get a number, so you cannot refer back
to its value with ‘\digit’. Shy groups are particularly useful for
mechanically-constructed regular expressions, because they can be
added automatically without altering the numbering of ordinary,
non-shy groups.
Shy groups are also called /non-capturing/ or /unnumbered groups/.
‘\(?num: … \)’
is the /explicitly numbered group/ construct. Normal groups get
their number implicitly, based on their position, which can be
inconvenient. This construct allows you to force a particular group
number. There is no particular restriction on the numbering, e.g.,
you can have several groups with the same number in which case the
last one to match (i.e., the rightmost match) will win. Implicitly
numbered groups always get the smallest integer larger than the one
of any previous group.
‘\digit’
matches the same text that matched the digitth occurrence of a
grouping (‘\( … \)’) construct.
In other words, after the end of a group, the matcher remembers the
beginning and end of the text matched by that group. Later on in the
regular expression you can use ‘\’ followed by digit to match that
same text, whatever it may have been.
The strings matching the first nine grouping constructs appearing in
the entire regular expression passed to a search or matching
function are assigned numbers 1 through 9 in the order that the open
parentheses appear in the regular expression. So you can use ‘\1’
through ‘\9’ to refer to the text matched by the corresponding
grouping constructs.
For example, ‘\(.*\)\1’ matches any newline-free string that is
composed of two identical halves. The ‘\(.*\)’ matches the first
half, which may be anything, but the ‘\1’ that follows must match
the same exact text.
If a ‘\( … \)’ construct matches more than once (which can happen,
for instance, if it is followed by ‘*’), only the last match is
recorded.
If a particular grouping construct in the regular expression was
never matched—for instance, if it appears inside of an alternative
that wasn’t used, or inside of a repetition that repeated zero
times—then the corresponding ‘\digit’ construct never matches
anything. To use an artificial example, ‘\(foo\(b*\)\|lose\)\2’
cannot match ‘lose’: the second alternative inside the larger group
matches it, but then ‘\2’ is undefined and can’t match anything. But
it can match ‘foobb’, because the first alternative matches ‘foob’
and ‘\2’ matches ‘b’.
‘\w’
matches any word-constituent character. The editor syntax table
determines which characters these are. See Syntax Tables
<#Syntax-Tables>.
‘\W’
matches any character that is not a word constituent.
‘\scode’
matches any character whose syntax is code. Here code is a character
that represents a syntax code: thus, ‘w’ for word constituent, ‘-’
for whitespace, ‘(’ for open parenthesis, etc. To represent
whitespace syntax, use either ‘-’ or a space character. See Syntax
Class Table <#Syntax-Class-Table>, for a list of syntax codes and
the characters that stand for them.
‘\Scode’
matches any character whose syntax is not code.
‘\cc’
matches any character whose category is c. Here c is a character
that represents a category: thus, ‘c’ for Chinese characters or ‘g’
for Greek characters in the standard category table. You can see the
list of all the currently defined categories with M-x
describe-categories RET. You can also define your own categories in
addition to the standard ones using the |define-category| function
(see Categories <#Categories>).
‘\Cc’
matches any character whose category is not c.
The following regular expression constructs match the empty string—that
is, they don’t use up any characters—but whether they match depends on
the context. For all, the beginning and end of the accessible portion of
the buffer are treated as if they were the actual beginning and end of
the buffer.
‘\`’
matches the empty string, but only at the beginning of the buffer or
string being matched against.
‘\'’
matches the empty string, but only at the end of the buffer or
string being matched against.
‘\=’
matches the empty string, but only at point. (This construct is not
defined when matching against a string.)
‘\b’
matches the empty string, but only at the beginning or end of a
word. Thus, ‘\bfoo\b’ matches any occurrence of ‘foo’ as a separate
word. ‘\bballs?\b’ matches ‘ball’ or ‘balls’ as a separate word.
‘\b’ matches at the beginning or end of the buffer (or string)
regardless of what text appears next to it.
‘\B’
matches the empty string, but /not/ at the beginning or end of a
word, nor at the beginning or end of the buffer (or string).
‘\<’
matches the empty string, but only at the beginning of a word. ‘\<’
matches at the beginning of the buffer (or string) only if a
word-constituent character follows.
‘\>’
matches the empty string, but only at the end of a word. ‘\>’
matches at the end of the buffer (or string) only if the contents
end with a word-constituent character.
‘\_<’
matches the empty string, but only at the beginning of a symbol. A
symbol is a sequence of one or more word or symbol constituent
characters. ‘\_<’ matches at the beginning of the buffer (or string)
only if a symbol-constituent character follows.
‘\_>’
matches the empty string, but only at the end of a symbol. ‘\_>’
matches at the end of the buffer (or string) only if the contents
end with a symbol-constituent character.
Not every string is a valid regular expression. For example, a string
that ends inside a character alternative without a terminating ‘]’ is
invalid, and so is a string that ends with a single ‘\’. If an invalid
regular expression is passed to any of the search functions, an
|invalid-regexp| error is signaled.
Next: Rx Notation <#Rx-Notation>, Previous: Syntax of Regexps
<#Syntax-of-Regexps>, Up: Regular Expressions <#Regular-Expressions>
[Contents <#SEC_Contents>][Index <#Index>]
34.3.2 Complex Regexp Example
Here is a complicated regexp which was formerly used by Emacs to
recognize the end of a sentence together with any whitespace that
follows. (Nowadays Emacs uses a similar but more complex default regexp
constructed by the function |sentence-end|. See Standard Regexps
<#Standard-Regexps>.)
Below, we show first the regexp as a string in Lisp syntax (to
distinguish spaces from tab characters), and then the result of
evaluating it. The string constant begins and ends with a double-quote.
‘\"’ stands for a double-quote as part of the string, ‘\\’ for a
backslash as part of the string, ‘\t’ for a tab and ‘\n’ for a newline.
"[.?!][]\"')}]*\\($\\| $\\|\t\\| \\)[ \t\n]*"
⇒ "[.?!][]\"')}]*\\($\\| $\\| \\| \\)[
]*"
In the output, tab and newline appear as themselves.
This regular expression contains four parts in succession and can be
deciphered as follows:
|[.?!]|
The first part of the pattern is a character alternative that
matches any one of three characters: period, question mark, and
exclamation mark. The match must begin with one of these three
characters. (This is one point where the new default regexp used by
Emacs differs from the old. The new value also allows some non-ASCII
characters that end a sentence without any following whitespace.)
|[]\"')}]*|
The second part of the pattern matches any closing braces and
quotation marks, zero or more of them, that may follow the period,
question mark or exclamation mark. The |\"| is Lisp syntax for a
double-quote in a string. The ‘*’ at the end indicates that the
immediately preceding regular expression (a character alternative,
in this case) may be repeated zero or more times.
|\\($\\| $\\|\t\\| \\)|
The third part of the pattern matches the whitespace that follows
the end of a sentence: the end of a line (optionally with a space),
or a tab, or two spaces. The double backslashes mark the parentheses
and vertical bars as regular expression syntax; the parentheses
delimit a group and the vertical bars separate alternatives. The
dollar sign is used to match the end of a line.
|[ \t\n]*|
Finally, the last part of the pattern matches any additional
whitespace beyond the minimum needed to end a sentence.
In the |rx| notation (see Rx Notation <#Rx-Notation>), the regexp could
be written
(rx (any ".?!") ; Punctuation ending sentence.
(zero-or-more (any "\"')]}")) ; Closing quotes or brackets.
(or line-end
(seq " " line-end)
"\t"
" ") ; Two spaces.
(zero-or-more (any "\t\n "))) ; Optional extra whitespace.
Since |rx| regexps are just S-expressions, they can be formatted and
commented as such.
Next: Regexp Functions <#Regexp-Functions>, Previous: Regexp Example
<#Regexp-Example>, Up: Regular Expressions <#Regular-Expressions>
[Contents <#SEC_Contents>][Index <#Index>]
34.3.3 The |rx| Structured Regexp Notation
As an alternative to the string-based syntax, Emacs provides the
structured |rx| notation based on Lisp S-expressions. This notation is
usually easier to read, write and maintain than regexp strings, and can
be indented and commented freely. It requires a conversion into string
form since that is what regexp functions expect, but that conversion
typically takes place during byte-compilation rather than when the Lisp
code using the regexp is run.
Here is an |rx| regexp^19 <#FOOT19> that matches a block comment in the
C programming language:
(rx "/*" ; Initial /*
(zero-or-more
(or (not (any "*")) ; Either non-*,
(seq "*" ; or * followed by
(not (any "/"))))) ; non-/
(one-or-more "*") ; At least one star,
"/") ; and the final /
or, using shorter synonyms and written more compactly,
(rx "/*"
(* (| (not "*")
(: "*" (not "/"))))
(+ "*") "/")
In conventional string syntax, it would be written
"/\\*\\(?:[^*]\\|\\*[^/]\\)*\\*+/"
The |rx| notation is mainly useful in Lisp code; it cannot be used in
most interactive situations where a regexp is requested, such as when
running |query-replace-regexp| or in variable customization.
• Rx Constructs <#Rx-Constructs> Constructs valid in rx forms.
• Rx Functions <#Rx-Functions> Functions and macros that use rx forms.
• Extending Rx <#Extending-Rx> How to define your own rx forms.
Next: Rx Functions <#Rx-Functions>, Up: Rx Notation <#Rx-Notation>
[Contents <#SEC_Contents>][Index <#Index>]
34.3.3.1 Constructs in |rx| regexps
The various forms in |rx| regexps are described below. The shorthand rx
represents any |rx| form, and rx… means zero or more |rx| forms. Where
the corresponding string regexp syntax is given, A, B, … are string
regexp subexpressions.
Literals
|"some-string"|
Match the string ‘some-string’ literally. There are no characters
with special meaning, unlike in string regexps.
|?C|
Match the character ‘C’ literally.
Sequence and alternative
|(seq rx…)|
|(sequence rx…)|
|(: rx…)|
|(and rx…)|
Match the rxs in sequence. Without arguments, the expression matches
the empty string.
Corresponding string regexp: ‘AB…’ (subexpressions in sequence).
|(or rx…)|
|(| rx…)|
Match exactly one of the rxs. If all arguments are strings,
characters, or |or| forms so constrained, the longest possible match
will always be used. Otherwise, either the longest match or the
first (in left-to-right order) will be used. Without arguments, the
expression will not match anything at all.
Corresponding string regexp: ‘A\|B\|…’.
|unmatchable|
Refuse any match. Equivalent to |(or)|. See regexp-unmatchable
<#regexp_002dunmatchable>.
Repetition
Normally, repetition forms are greedy, in that they attempt to match as
many times as possible. Some forms are non-greedy; they try to match as
few times as possible (see Non-greedy repetition
<#Non_002dgreedy-repetition>).
|(zero-or-more rx…)|
|(0+ rx…)|
Match the rxs zero or more times. Greedy by default.
Corresponding string regexp: ‘A*’ (greedy), ‘A*?’ (non-greedy)
|(one-or-more rx…)|
|(1+ rx…)|
Match the rxs one or more times. Greedy by default.
Corresponding string regexp: ‘A+’ (greedy), ‘A+?’ (non-greedy)
|(zero-or-one rx…)|
|(optional rx…)|
|(opt rx…)|
Match the rxs once or an empty string. Greedy by default.
Corresponding string regexp: ‘A?’ (greedy), ‘A??’ (non-greedy).
|(* rx…)|
Match the rxs zero or more times. Greedy.
Corresponding string regexp: ‘A*’
|(+ rx…)|
Match the rxs one or more times. Greedy.
Corresponding string regexp: ‘A+’
|(? rx…)|
Match the rxs once or an empty string. Greedy.
Corresponding string regexp: ‘A?’
|(*? rx…)|
Match the rxs zero or more times. Non-greedy.
Corresponding string regexp: ‘A*?’
|(+? rx…)|
Match the rxs one or more times. Non-greedy.
Corresponding string regexp: ‘A+?’
|(?? rx…)|
Match the rxs or an empty string. Non-greedy.
Corresponding string regexp: ‘A??’
|(= n rx…)|
|(repeat n rx)|
Match the rxs exactly n times.
Corresponding string regexp: ‘A\{n\}’
|(>= n rx…)|
Match the rxs n or more times. Greedy.
Corresponding string regexp: ‘A\{n,\}’
|(** n m rx…)|
|(repeat n m rx…)|
Match the rxs at least n but no more than m times. Greedy.
Corresponding string regexp: ‘A\{n,m\}’
The greediness of some repetition forms can be controlled using the
following constructs. However, it is usually better to use the explicit
non-greedy forms above when such matching is required.
|(minimal-match rx)|
Match rx, with |zero-or-more|, |0+|, |one-or-more|, |1+|,
|zero-or-one|, |opt| and |optional| using non-greedy matching.
|(maximal-match rx)|
Match rx, with |zero-or-more|, |0+|, |one-or-more|, |1+|,
|zero-or-one|, |opt| and |optional| using greedy matching. This is
the default.
Matching single characters
|(any set…)|
|(char set…)|
|(in set…)|
Match a single character from one of the sets. Each set is a
character, a string representing the set of its characters, a range
or a character class (see below). A range is either a
hyphen-separated string like |"A-Z"|, or a cons of characters like
|(?A . ?Z)|.
Note that hyphen (|-|) is special in strings in this construct,
since it acts as a range separator. To include a hyphen, add it as a
separate character or single-character string.
Corresponding string regexp: ‘[…]’
|(not charspec)|
Match a character not included in charspec. charspec can be a
character, a single-character string, an |any|, |not|, |or|,
|intersection|, |syntax| or |category| form, or a character class.
If charspec is an |or| form, its arguments have the same
restrictions as those of |intersection|; see below.
Corresponding string regexp: ‘[^…]’, ‘\Scode’, ‘\Ccode’
|(intersection charset…)|
Match a character included in all of the charsets. Each charset can
be a character, a single-character string, an |any| form without
character classes, or an |intersection|, |or| or |not| form whose
arguments are also charsets.
|not-newline|, |nonl|
Match any character except a newline.
Corresponding string regexp: ‘.’ (dot)
|anychar|, |anything|
Match any character.
Corresponding string regexp: ‘.\|\n’ (for example)
character class
Match a character from a named character class:
|alpha|, |alphabetic|, |letter|
Match alphabetic characters. More precisely, match characters
whose Unicode ‘general-category’ property indicates that they
are alphabetic.
|alnum|, |alphanumeric|
Match alphabetic characters and digits. More precisely, match
characters whose Unicode ‘general-category’ property indicates
that they are alphabetic or decimal digits.
|digit|, |numeric|, |num|
Match the digits ‘0’–‘9’.
|xdigit|, |hex-digit|, |hex|
Match the hexadecimal digits ‘0’–‘9’, ‘A’–‘F’ and ‘a’–‘f’.
|cntrl|, |control|
Match any character whose code is in the range 0–31.
|blank|
Match horizontal whitespace. More precisely, match characters
whose Unicode ‘general-category’ property indicates that they
are spacing separators.
|space|, |whitespace|, |white|
Match any character that has whitespace syntax (see Syntax Class
Table <#Syntax-Class-Table>).
|lower|, |lower-case|
Match anything lower-case, as determined by the current case
table. If |case-fold-search| is non-nil, this also matches any
upper-case letter.
|upper|, |upper-case|
Match anything upper-case, as determined by the current case
table. If |case-fold-search| is non-nil, this also matches any
lower-case letter.
|graph|, |graphic|
Match any character except whitespace, ASCII and non-ASCII
control characters, surrogates, and codepoints unassigned by
Unicode, as indicated by the Unicode ‘general-category’ property.
|print|, |printing|
Match whitespace or a character matched by |graph|.
|punct|, |punctuation|
Match any punctuation character. (At present, for multibyte
characters, anything that has non-word syntax.)
|word|, |wordchar|
Match any character that has word syntax (see Syntax Class Table
<#Syntax-Class-Table>).
|ascii|
Match any ASCII character (codes 0–127).
|nonascii|
Match any non-ASCII character (but not raw bytes).
Corresponding string regexp: ‘[[:class:]]’
|(syntax syntax)|
Match a character with syntax syntax, being one of the following names:
Syntax name Syntax character
|whitespace| |-|
|punctuation| |.|
|word| |w|
|symbol| |_|
|open-parenthesis| |(|
|close-parenthesis| |)|
|expression-prefix| |'|
|string-quote| |"|
|paired-delimiter| |$|
|escape| |\|
|character-quote| |/|
|comment-start| |<|
|comment-end| |>|
|string-delimiter| |||
|comment-delimiter| |!|
For details, see Syntax Class Table <#Syntax-Class-Table>. Please
note that |(syntax punctuation)| is /not/ equivalent to the
character class |punctuation|.
Corresponding string regexp: ‘\scode’
|(category category)|
Match a character in category category, which is either one of the
names below or its category character.
Category name Category character
|space-for-indent| space
|base| |.|
|consonant| |0|
|base-vowel| |1|
|upper-diacritical-mark| |2|
|lower-diacritical-mark| |3|
|tone-mark| |4|
|symbol| |5|
|digit| |6|
|vowel-modifying-diacritical-mark| |7|
|vowel-sign| |8|
|semivowel-lower| |9|
|not-at-end-of-line| |<|
|not-at-beginning-of-line| |>|
|alpha-numeric-two-byte| |A|
|chinese-two-byte| |C|
|greek-two-byte| |G|
|japanese-hiragana-two-byte| |H|
|indian-two-byte| |I|
|japanese-katakana-two-byte| |K|
|strong-left-to-right| |L|
|korean-hangul-two-byte| |N|
|strong-right-to-left| |R|
|cyrillic-two-byte| |Y|
|combining-diacritic| |^|
|ascii| |a|
|arabic| |b|
|chinese| |c|
|ethiopic| |e|
|greek| |g|
|korean| |h|
|indian| |i|
|japanese| |j|
|japanese-katakana| |k|
|latin| |l|
|lao| |o|
|tibetan| |q|
|japanese-roman| |r|
|thai| |t|
|vietnamese| |v|
|hebrew| |w|
|cyrillic| |y|
|can-break| |||
For more information about currently defined categories, run the
command M-x describe-categories RET. For how to define new
categories, see Categories <#Categories>.
Corresponding string regexp: ‘\ccode’
Zero-width assertions
These all match the empty string, but only in specific places.
|line-start|, |bol|
Match at the beginning of a line.
Corresponding string regexp: ‘^’
|line-end|, |eol|
Match at the end of a line.
Corresponding string regexp: ‘$’
|string-start|, |bos|, |buffer-start|, |bot|
Match at the start of the string or buffer being matched against.
Corresponding string regexp: ‘\`’
|string-end|, |eos|, |buffer-end|, |eot|
Match at the end of the string or buffer being matched against.
Corresponding string regexp: ‘\'’
|point|
Match at point.
Corresponding string regexp: ‘\=’
|word-start|, |bow|
Match at the beginning of a word.
Corresponding string regexp: ‘\<’
|word-end|, |eow|
Match at the end of a word.
Corresponding string regexp: ‘\>’
|word-boundary|
Match at the beginning or end of a word.
Corresponding string regexp: ‘\b’
|not-word-boundary|
Match anywhere but at the beginning or end of a word.
Corresponding string regexp: ‘\B’
|symbol-start|
Match at the beginning of a symbol.
Corresponding string regexp: ‘\_<’
|symbol-end|
Match at the end of a symbol.
Corresponding string regexp: ‘\_>’
Capture groups
|(group rx…)|
|(submatch rx…)|
Match the rxs, making the matched text and position accessible in
the match data. The first group in a regexp is numbered 1;
subsequent groups will be numbered one higher than the previous group.
Corresponding string regexp: ‘\(…\)’
|(group-n n rx…)|
|(submatch-n n rx…)|
Like |group|, but explicitly assign the group number n. n must be
positive.
Corresponding string regexp: ‘\(?n:…\)’
|(backref n)|
Match the text previously matched by group number n. n must be in
the range 1–9.
Corresponding string regexp: ‘\n’
Dynamic inclusion
|(literal expr)|
Match the literal string that is the result from evaluating the Lisp
expression expr. The evaluation takes place at call time, in the
current lexical environment.
|(regexp expr)|
|(regex expr)|
Match the string regexp that is the result from evaluating the Lisp
expression expr. The evaluation takes place at call time, in the
current lexical environment.
|(eval expr)|
Match the rx form that is the result from evaluating the Lisp
expression expr. The evaluation takes place at macro-expansion time
for |rx|, at call time for |rx-to-string|, in the current global
environment.
Next: Extending Rx <#Extending-Rx>, Previous: Rx Constructs
<#Rx-Constructs>, Up: Rx Notation <#Rx-Notation> [Contents
<#SEC_Contents>][Index <#Index>]
34.3.3.2 Functions and macros using |rx| regexps
Macro: *rx* /rx-expr…/
Translate the rx-exprs to a string regexp, as if they were the body
of a |(seq …)| form. The |rx| macro expands to a string constant,
or, if |literal| or |regexp| forms are used, a Lisp expression that
evaluates to a string.
Function: *rx-to-string* /rx-expr &optional no-group/
Translate rx-expr to a string regexp which is returned. If no-group
is absent or nil, bracket the result in a non-capturing group,
‘\(?:…\)’, if necessary to ensure that a postfix operator appended
to it will apply to the whole expression.
Arguments to |literal| and |regexp| forms in rx-expr must be string
literals.
The |pcase| macro can use |rx| expressions as patterns directly; see rx
in pcase <#rx-in-pcase>.
For mechanisms to add user-defined extensions to the |rx| notation, see
Extending Rx <#Extending-Rx>.
Previous: Rx Functions <#Rx-Functions>, Up: Rx Notation <#Rx-Notation>
[Contents <#SEC_Contents>][Index <#Index>]
34.3.3.3 Defining new |rx| forms
The |rx| notation can be extended by defining new symbols and
parameterized forms in terms of other |rx| expressions. This is handy
for sharing parts between several regexps, and for making complex ones
easier to build and understand by putting them together from smaller
pieces.
For example, you could define |name| to mean |(one-or-more letter)|, and
|(quoted x)| to mean |(seq ?' x ?')| for any x. These forms could then
be used in |rx| expressions like any other: |(rx (quoted name))| would
match a nonempty sequence of letters inside single quotes.
The Lisp macros below provide different ways of binding names to
definitions. Common to all of them are the following rules:
* Built-in |rx| forms, like |digit| and |group|, cannot be redefined.
* The definitions live in a name space of their own, separate from
that of Lisp variables. There is thus no need to attach a suffix
like |-regexp| to names; they cannot collide with anything else.
* Definitions cannot refer to themselves recursively, directly or
indirectly. If you find yourself needing this, you want a parser,
not a regular expression.
* Definitions are only ever expanded in calls to |rx| or
|rx-to-string|, not merely by their presence in definition macros.
This means that the order of definitions doesn’t matter, even when
they refer to each other, and that syntax errors only show up when
they are used, not when they are defined.
* User-defined forms are allowed wherever arbitrary |rx| expressions
are expected; for example, in the body of a |zero-or-one| form, but
not inside |any| or |category| forms. They are also allowed inside
|not| and |intersection| forms.
Macro: *rx-define* /name [arglist] rx-form/
Define name globally in all subsequent calls to |rx| and
|rx-to-string|. If arglist is absent, then name is defined as a
plain symbol to be replaced with rx-form. Example:
(rx-define haskell-comment (seq "--" (zero-or-more nonl)))
(rx haskell-comment)
⇒ "--.*"
If arglist is present, it must be a list of zero or more argument
names, and name is then defined as a parameterized form. When used
in an |rx| expression as |(name arg…)|, each arg will replace the
corresponding argument name inside rx-form.
arglist may end in |&rest| and one final argument name, denoting a
rest parameter. The rest parameter will expand to all extra actual
argument values not matched by any other parameter in arglist,
spliced into rx-form where it occurs. Example:
(rx-define moan (x y &rest r) (seq x (one-or-more y) r "!"))
(rx (moan "MOO" "A" "MEE" "OW"))
⇒ "MOOA+MEEOW!"
Since the definition is global, it is recommended to give name a
package prefix to avoid name clashes with definitions elsewhere, as
is usual when naming non-local variables and functions.
Macro: *rx-let* /(bindings…) body…/
Make the |rx| definitions in bindings available locally for |rx|
macro invocations in body, which is then evaluated.
Each element of bindings is on the form |(name [arglist] rx-form)|,
where the parts have the same meaning as in |rx-define| above. Example:
(rx-let ((comma-separated (item) (seq item (0+ "," item)))
(number (1+ digit))
(numbers (comma-separated number)))
(re-search-forward (rx "(" numbers ")")))
The definitions are only available during the macro-expansion of
body, and are thus not present during execution of compiled code.
|rx-let| can be used not only inside a function, but also at top
level to include global variable and function definitions that need
to share a common set of |rx| forms. Since the names are local
inside body, there is no need for any package prefixes. Example:
(rx-let ((phone-number (seq (opt ?+) (1+ (any digit ?-)))))
(defun find-next-phone-number ()
(re-search-forward (rx phone-number)))
(defun phone-number-p (string)
(string-match-p (rx bos phone-number eos) string)))
The scope of the |rx-let| bindings is lexical, which means that they
are not visible outside body itself, even in functions called from
body.
Macro: *rx-let-eval* /bindings body…/
Evaluate bindings to a list of bindings as in |rx-let|, and evaluate
body with those bindings in effect for calls to |rx-to-string|.
This macro is similar to |rx-let|, except that the bindings argument
is evaluated (and thus needs to be quoted if it is a list literal),
and the definitions are substituted at run time, which is required
for |rx-to-string| to work. Example:
(rx-let-eval
'((ponder (x) (seq "Where have all the " x " gone?")))
(looking-at (rx-to-string
'(ponder (or "flowers" "young girls"
"left socks")))))
Another difference from |rx-let| is that the bindings are
dynamically scoped, and thus also available in functions called from
body. However, they are not visible inside functions defined in body.
Previous: Rx Notation <#Rx-Notation>, Up: Regular Expressions
<#Regular-Expressions> [Contents <#SEC_Contents>][Index <#Index>]
34.3.4 Regular Expression Functions
These functions operate on regular expressions.
Function: *regexp-quote* /string/
This function returns a regular expression whose only exact match is
string. Using this regular expression in |looking-at| will succeed
only if the next characters in the buffer are string; using it in a
search function will succeed if the text being searched contains
string. See Regexp Search <#Regexp-Search>.
This allows you to request an exact string match or search when
calling a function that wants a regular expression.
(regexp-quote "^The cat$")
⇒ "\\^The cat\\$"
One use of |regexp-quote| is to combine an exact string match with
context described as a regular expression. For example, this
searches for the string that is the value of string, surrounded by
whitespace:
(re-search-forward
(concat "\\s-" (regexp-quote string) "\\s-"))
The returned string may be string itself if it does not contain any
special characters.
Function: *regexp-opt* /strings &optional paren/
This function returns an efficient regular expression that will
match any of the strings in the list strings. This is useful when
you need to make matching or searching as fast as possible—for
example, for Font Lock mode^20 <#FOOT20>.
If strings is the empty list, the return value is a regexp that
never matches anything.
The optional argument paren can be any of the following:
a string
The resulting regexp is preceded by paren and followed by ‘\)’,
e.g. use ‘"\\(?1:"’ to produce an explicitly numbered group.
|words|
The resulting regexp is surrounded by ‘\<\(’ and ‘\)\>’.
|symbols|
The resulting regexp is surrounded by ‘\_<\(’ and ‘\)\_>’ (this
is often appropriate when matching programming-language keywords
and the like).
non-|nil|
The resulting regexp is surrounded by ‘\(’ and ‘\)’.
|nil|
The resulting regexp is surrounded by ‘\(?:’ and ‘\)’, if it is
necessary to ensure that a postfix operator appended to it will
apply to the whole expression.
The returned regexp is ordered in such a way that it will always
match the longest string possible.
Up to reordering, the resulting regexp of |regexp-opt| is equivalent
to but usually more efficient than that of a simplified version:
(defun simplified-regexp-opt (strings &optional paren)
(let ((parens
(cond
((stringp paren) (cons paren "\\)"))
((eq paren 'words) '("\\<\\(" . "\\)\\>"))
((eq paren 'symbols) '("\\_<\\(" . "\\)\\_>"))
((null paren) '("\\(?:" . "\\)"))
(t '("\\(" . "\\)")))))
(concat (car parens)
(mapconcat 'regexp-quote strings "\\|")
(cdr parens))))
Function: *regexp-opt-depth* /regexp/
This function returns the total number of grouping constructs
(parenthesized expressions) in regexp. This does not include shy
groups (see Regexp Backslash <#Regexp-Backslash>).
Function: *regexp-opt-charset* /chars/
This function returns a regular expression matching a character in
the list of characters chars.
(regexp-opt-charset '(?a ?b ?c ?d ?e))
⇒ "[a-e]"
Variable: *regexp-unmatchable*
This variable contains a regexp that is guaranteed not to match any
string at all. It is particularly useful as default value for
variables that may be set to a pattern that actually matches something.
Next: POSIX Regexps <#POSIX-Regexps>, Previous: Regular Expressions
<#Regular-Expressions>, Up: Searching and Matching
<#Searching-and-Matching> [Contents <#SEC_Contents>][Index <#Index>]
34.4 Regular Expression Searching
In GNU Emacs, you can search for the next match for a regular expression
(see Syntax of Regexps <#Syntax-of-Regexps>) either incrementally or
not. For incremental search commands, see Regular Expression Search
in The GNU Emacs Manual. Here we describe only the search functions
useful in programs. The principal one is |re-search-forward|.
These search functions convert the regular expression to multibyte if
the buffer is multibyte; they convert the regular expression to unibyte
if the buffer is unibyte. See Text Representations <#Text-Representations>.
Command: *re-search-forward* /regexp &optional limit noerror count/
This function searches forward in the current buffer for a string of
text that is matched by the regular expression regexp. The function
skips over any amount of text that is not matched by regexp, and
leaves point at the end of the first match found. It returns the new
value of point.
If limit is non-|nil|, it must be a position in the current buffer.
It specifies the upper bound to the search. No match extending after
that position is accepted. If limit is omitted or |nil|, it defaults
to the end of the accessible portion of the buffer.
What |re-search-forward| does when the search fails depends on the
value of noerror:
|nil|
Signal a |search-failed| error.
|t|
Do nothing and return |nil|.
anything else
Move point to limit (or the end of the accessible portion of the
buffer) and return |nil|.
The argument noerror only affects valid searches which fail to find
a match. Invalid arguments cause errors regardless of noerror.
If count is a positive number n, the search is done n times; each
successive search starts at the end of the previous match. If all
these successive searches succeed, the function call succeeds,
moving point and returning its new value. Otherwise the function
call fails, with results depending on the value of noerror, as
described above. If count is a negative number -n, the search is
done n times in the opposite (backward) direction.
In the following example, point is initially before the ‘T’.
Evaluating the search call moves point to the end of that line
(between the ‘t’ of ‘hat’ and the newline).
---------- Buffer: foo ----------
I read "∗The cat in the hat
comes back" twice.
---------- Buffer: foo ----------
(re-search-forward "[a-z]+" nil t 5)
⇒ 27
---------- Buffer: foo ----------
I read "The cat in the hat∗
comes back" twice.
---------- Buffer: foo ----------
Command: *re-search-backward* /regexp &optional limit noerror count/
This function searches backward in the current buffer for a string
of text that is matched by the regular expression regexp, leaving
point at the beginning of the first text found.
This function is analogous to |re-search-forward|, but they are not
simple mirror images. |re-search-forward| finds the match whose
beginning is as close as possible to the starting point. If
|re-search-backward| were a perfect mirror image, it would find the
match whose end is as close as possible. However, in fact it finds
the match whose beginning is as close as possible (and yet ends
before the starting point). The reason for this is that matching a
regular expression at a given spot always works from beginning to
end, and starts at a specified beginning position.
A true mirror-image of |re-search-forward| would require a special
feature for matching regular expressions from end to beginning. It’s
not worth the trouble of implementing that.
Function: *string-match* /regexp string &optional start/
This function returns the index of the start of the first match for
the regular expression regexp in string, or |nil| if there is no
match. If start is non-|nil|, the search starts at that index in
string.
For example,
(string-match
"quick" "The quick brown fox jumped quickly.")
⇒ 4
(string-match
"quick" "The quick brown fox jumped quickly." 8)
⇒ 27
The index of the first character of the string is 0, the index of
the second character is 1, and so on.
If this function finds a match, the index of the first character
beyond the match is available as |(match-end 0)|. See Match Data
<#Match-Data>.
(string-match
"quick" "The quick brown fox jumped quickly." 8)
⇒ 27
(match-end 0)
⇒ 32
Function: *string-match-p* /regexp string &optional start/
This predicate function does what |string-match| does, but it avoids
modifying the match data.
Function: *looking-at* /regexp/
This function determines whether the text in the current buffer
directly following point matches the regular expression regexp.
“Directly following” means precisely that: the search is “anchored”
and it can succeed only starting with the first character following
point. The result is |t| if so, |nil| otherwise.
This function does not move point, but it does update the match
data. See Match Data <#Match-Data>. If you need to test for a match
without modifying the match data, use |looking-at-p|, described below.
In this example, point is located directly before the ‘T’. If it
were anywhere else, the result would be |nil|.
---------- Buffer: foo ----------
I read "∗The cat in the hat
comes back" twice.
---------- Buffer: foo ----------
(looking-at "The cat in the hat$")
⇒ t
Function: *looking-back* /regexp limit &optional greedy/
This function returns |t| if regexp matches the text immediately
before point (i.e., ending at point), and |nil| otherwise.
Because regular expression matching works only going forward, this
is implemented by searching backwards from point for a match that
ends at point. That can be quite slow if it has to search a long
distance. You can bound the time required by specifying a non-|nil|
value for limit, which says not to search before limit. In this
case, the match that is found must begin at or after limit. Here’s
an example:
---------- Buffer: foo ----------
I read "∗The cat in the hat
comes back" twice.
---------- Buffer: foo ----------
(looking-back "read \"" 3)
⇒ t
(looking-back "read \"" 4)
⇒ nil
If greedy is non-|nil|, this function extends the match backwards as
far as possible, stopping when a single additional previous
character cannot be part of a match for regexp. When the match is
extended, its starting position is allowed to occur before limit.
As a general recommendation, try to avoid using |looking-back|
wherever possible, since it is slow. For this reason, there are no
plans to add a |looking-back-p| function.
Function: *looking-at-p* /regexp/
This predicate function works like |looking-at|, but without
updating the match data.
Variable: *search-spaces-regexp*
If this variable is non-|nil|, it should be a regular expression
that says how to search for whitespace. In that case, any group of
spaces in a regular expression being searched for stands for use of
this regular expression. However, spaces inside of constructs such
as ‘[…]’ and ‘*’, ‘+’, ‘?’ are not affected by |search-spaces-regexp|.
Since this variable affects all regular expression search and match
constructs, you should bind it temporarily for as small as possible
a part of the code.
Next: Match Data <#Match-Data>, Previous: Regexp Search
<#Regexp-Search>, Up: Searching and Matching <#Searching-and-Matching>
[Contents <#SEC_Contents>][Index <#Index>]
34.5 POSIX Regular Expression Searching
The usual regular expression functions do backtracking when necessary to
handle the ‘\|’ and repetition constructs, but they continue this only
until they find /some/ match. Then they succeed and report the first
match found.
This section describes alternative search functions which perform the
full backtracking specified by the POSIX standard for regular expression
matching. They continue backtracking until they have tried all
possibilities and found all matches, so they can report the longest
match, as required by POSIX. This is much slower, so use these functions
only when you really need the longest match.
The POSIX search and match functions do not properly support the
non-greedy repetition operators (see non-greedy <#Regexp-Special>). This
is because POSIX backtracking conflicts with the semantics of non-greedy
repetition.
Command: *posix-search-forward* /regexp &optional limit noerror count/
This is like |re-search-forward| except that it performs the full
backtracking specified by the POSIX standard for regular expression
matching.
Command: *posix-search-backward* /regexp &optional limit noerror count/
This is like |re-search-backward| except that it performs the full
backtracking specified by the POSIX standard for regular expression
matching.
Function: *posix-looking-at* /regexp/
This is like |looking-at| except that it performs the full
backtracking specified by the POSIX standard for regular expression
matching.
Function: *posix-string-match* /regexp string &optional start/
This is like |string-match| except that it performs the full
backtracking specified by the POSIX standard for regular expression
matching.
Next: Search and Replace <#Search-and-Replace>, Previous: POSIX Regexps
<#POSIX-Regexps>, Up: Searching and Matching <#Searching-and-Matching>
[Contents <#SEC_Contents>][Index <#Index>]
34.6 The Match Data
Emacs keeps track of the start and end positions of the segments of text
found during a search; this is called the /match data/. Thanks to the
match data, you can search for a complex pattern, such as a date in a
mail message, and then extract parts of the match under control of the
pattern.
Because the match data normally describe the most recent search only,
you must be careful not to do another search inadvertently between the
search you wish to refer back to and the use of the match data. If you
can’t avoid another intervening search, you must save and restore the
match data around it, to prevent it from being overwritten.
Notice that all functions are allowed to overwrite the match data unless
they’re explicitly documented not to do so. A consequence is that
functions that are run implicitly in the background (see Timers
<#Timers>, and Idle Timers <#Idle-Timers>) should likely save and
restore the match data explicitly.
• Replacing Match <#Replacing-Match> Replacing a substring that was
matched.
• Simple Match Data <#Simple-Match-Data> Accessing single items of
match data, such as where a particular subexpression started.
• Entire Match Data <#Entire-Match-Data> Accessing the entire match
data at once, as a list.
• Saving Match Data <#Saving-Match-Data> Saving and restoring the
match data.
Next: Simple Match Data <#Simple-Match-Data>, Up: Match Data
<#Match-Data> [Contents <#SEC_Contents>][Index <#Index>]
34.6.1 Replacing the Text that Matched
This function replaces all or part of the text matched by the last
search. It works by means of the match data.
Function: *replace-match* /replacement &optional fixedcase literal
string subexp/
This function performs a replacement operation on a buffer or string.
If you did the last search in a buffer, you should omit the string
argument or specify |nil| for it, and make sure that the current
buffer is the one in which you performed the last search. Then this
function edits the buffer, replacing the matched text with
replacement. It leaves point at the end of the replacement text.
If you performed the last search on a string, pass the same string
as string. Then this function returns a new string, in which the
matched text is replaced by replacement.
If fixedcase is non-|nil|, then |replace-match| uses the replacement
text without case conversion; otherwise, it converts the replacement
text depending upon the capitalization of the text to be replaced.
If the original text is all upper case, this converts the
replacement text to upper case. If all words of the original text
are capitalized, this capitalizes all the words of the replacement
text. If all the words are one-letter and they are all upper case,
they are treated as capitalized words rather than all-upper-case words.
If literal is non-|nil|, then replacement is inserted exactly as it
is, the only alterations being case changes as needed. If it is
|nil| (the default), then the character ‘\’ is treated specially. If
a ‘\’ appears in replacement, then it must be part of one of the
following sequences:
‘\&’
This stands for the entire text being replaced.
‘\n’, where n is a digit
This stands for the text that matched the nth subexpression in
the original regexp. Subexpressions are those expressions
grouped inside ‘\(…\)’. If the nth subexpression never matched,
an empty string is substituted.
‘\\’
This stands for a single ‘\’ in the replacement text.
‘\?’
This stands for itself (for compatibility with |replace-regexp|
and related commands; see Regexp Replace
in The GNU Emacs Manual).
Any other character following ‘\’ signals an error.
The substitutions performed by ‘\&’ and ‘\n’ occur after case
conversion, if any. Therefore, the strings they substitute are never
case-converted.
If subexp is non-|nil|, that says to replace just subexpression
number subexp of the regexp that was matched, not the entire match.
For example, after matching ‘foo \(ba*r\)’, calling |replace-match|
with 1 as subexp means to replace just the text that matched
‘\(ba*r\)’.
Function: *match-substitute-replacement* /replacement &optional
fixedcase literal string subexp/
This function returns the text that would be inserted into the
buffer by |replace-match|, but without modifying the buffer. It is
useful if you want to present the user with actual replacement
result, with constructs like ‘\n’ or ‘\&’ substituted with matched
groups. Arguments replacement and optional fixedcase, literal,
string and subexp have the same meaning as for |replace-match|.
Next: Entire Match Data <#Entire-Match-Data>, Previous: Replacing Match
<#Replacing-Match>, Up: Match Data <#Match-Data> [Contents
<#SEC_Contents>][Index <#Index>]
34.6.2 Simple Match Data Access
This section explains how to use the match data to find out what was
matched by the last search or match operation, if it succeeded.
You can ask about the entire matching text, or about a particular
parenthetical subexpression of a regular expression. The count argument
in the functions below specifies which. If count is zero, you are asking
about the entire match. If count is positive, it specifies which
subexpression you want.
Recall that the subexpressions of a regular expression are those
expressions grouped with escaped parentheses, ‘\(…\)’. The countth
subexpression is found by counting occurrences of ‘\(’ from the
beginning of the whole regular expression. The first subexpression is
numbered 1, the second 2, and so on. Only regular expressions can have
subexpressions—after a simple string search, the only information
available is about the entire match.
Every successful search sets the match data. Therefore, you should query
the match data immediately after searching, before calling any other
function that might perform another search. Alternatively, you may save
and restore the match data (see Saving Match Data <#Saving-Match-Data>)
around the call to functions that could perform another search. Or use
the functions that explicitly do not modify the match data; e.g.,
|string-match-p|.
A search which fails may or may not alter the match data. In the current
implementation, it does not, but we may change it in the future. Don’t
try to rely on the value of the match data after a failing search.
Function: *match-string* /count &optional in-string/
This function returns, as a string, the text matched in the last
search or match operation. It returns the entire text if count is
zero, or just the portion corresponding to the countth parenthetical
subexpression, if count is positive.
If the last such operation was done against a string with
|string-match|, then you should pass the same string as the argument
in-string. After a buffer search or match, you should omit in-string
or pass |nil| for it; but you should make sure that the current
buffer when you call |match-string| is the one in which you did the
searching or matching. Failure to follow this advice will lead to
incorrect results.
The value is |nil| if count is out of range, or for a subexpression
inside a ‘\|’ alternative that wasn’t used or a repetition that
repeated zero times.
Function: *match-string-no-properties* /count &optional in-string/
This function is like |match-string| except that the result has no
text properties.
Function: *match-beginning* /count/
If the last regular expression search found a match, this function
returns the position of the start of the matching text or of a
subexpression of it.
If count is zero, then the value is the position of the start of the
entire match. Otherwise, count specifies a subexpression in the
regular expression, and the value of the function is the starting
position of the match for that subexpression.
The value is |nil| for a subexpression inside a ‘\|’ alternative
that wasn’t used or a repetition that repeated zero times.
Function: *match-end* /count/
This function is like |match-beginning| except that it returns the
position of the end of the match, rather than the position of the
beginning.
Here is an example of using the match data, with a comment showing the
positions within the text:
(string-match "\\(qu\\)\\(ick\\)"
"The quick fox jumped quickly.")
;0123456789
⇒ 4
(match-string 0 "The quick fox jumped quickly.")
⇒ "quick"
(match-string 1 "The quick fox jumped quickly.")
⇒ "qu"
(match-string 2 "The quick fox jumped quickly.")
⇒ "ick"
(match-beginning 1) ; The beginning of the match
⇒ 4 ; with ‘qu’ is at index 4.
(match-beginning 2) ; The beginning of the match
⇒ 6 ; with ‘ick’ is at index 6.
(match-end 1) ; The end of the match
⇒ 6 ; with ‘qu’ is at index 6.
(match-end 2) ; The end of the match
⇒ 9 ; with ‘ick’ is at index 9.
Here is another example. Point is initially located at the beginning of
the line. Searching moves point to between the space and the word ‘in’.
The beginning of the entire match is at the 9th character of the buffer
(‘T’), and the beginning of the match for the first subexpression is at
the 13th character (‘c’).
(list
(re-search-forward "The \\(cat \\)")
(match-beginning 0)
(match-beginning 1))
⇒ (17 9 13)
---------- Buffer: foo ----------
I read "The cat ∗in the hat comes back" twice.
^ ^
9 13
---------- Buffer: foo ----------
(In this case, the index returned is a buffer position; the first
character of the buffer counts as 1.)
Next: Saving Match Data <#Saving-Match-Data>, Previous: Simple Match
Data <#Simple-Match-Data>, Up: Match Data <#Match-Data> [Contents
<#SEC_Contents>][Index <#Index>]
34.6.3 Accessing the Entire Match Data
The functions |match-data| and |set-match-data| read or write the entire
match data, all at once.
Function: *match-data* /&optional integers reuse reseat/
This function returns a list of positions (markers or integers) that
record all the information on the text that the last search matched.
Element zero is the position of the beginning of the match for the
whole expression; element one is the position of the end of the
match for the expression. The next two elements are the positions of
the beginning and end of the match for the first subexpression, and
so on. In general, element number 2n corresponds to
|(match-beginning n)|; and element number 2n + 1 corresponds to
|(match-end n)|.
Normally all the elements are markers or |nil|, but if integers is
non-|nil|, that means to use integers instead of markers. (In that
case, the buffer itself is appended as an additional element at the
end of the list, to facilitate complete restoration of the match
data.) If the last match was done on a string with |string-match|,
then integers are always used, since markers can’t point into a string.
If reuse is non-|nil|, it should be a list. In that case,
|match-data| stores the match data in reuse. That is, reuse is
destructively modified. reuse does not need to have the right
length. If it is not long enough to contain the match data, it is
extended. If it is too long, the length of reuse stays the same, but
the elements that were not used are set to |nil|. The purpose of
this feature is to reduce the need for garbage collection.
If reseat is non-|nil|, all markers on the reuse list are reseated
to point to nowhere.
As always, there must be no possibility of intervening searches
between the call to a search function and the call to |match-data|
that is intended to access the match data for that search.
(match-data)
⇒ (#
#
#
#)
Function: *set-match-data* /match-list &optional reseat/
This function sets the match data from the elements of match-list,
which should be a list that was the value of a previous call to
|match-data|. (More precisely, anything that has the same format
will work.)
If match-list refers to a buffer that doesn’t exist, you don’t get
an error; that sets the match data in a meaningless but harmless way.
If reseat is non-|nil|, all markers on the match-list list are
reseated to point to nowhere.
|store-match-data| is a semi-obsolete alias for |set-match-data|.
Previous: Entire Match Data <#Entire-Match-Data>, Up: Match Data
<#Match-Data> [Contents <#SEC_Contents>][Index <#Index>]
34.6.4 Saving and Restoring the Match Data
When you call a function that may search, you may need to save and
restore the match data around that call, if you want to preserve the
match data from an earlier search for later use. Here is an example that
shows the problem that arises if you fail to save the match data:
(re-search-forward "The \\(cat \\)")
⇒ 48
(foo) ; |foo| does more searching.
(match-end 0)
⇒ 61 ; Unexpected result—not 48!
You can save and restore the match data with |save-match-data|:
Macro: *save-match-data* /body…/
This macro executes body, saving and restoring the match data around
it. The return value is the value of the last form in body.
You could use |set-match-data| together with |match-data| to imitate the
effect of the special form |save-match-data|. Here is how:
(let ((data (match-data)))
(unwind-protect
… ; Ok to change the original match data.
(set-match-data data)))
Emacs automatically saves and restores the match data when it runs
process filter functions (see Filter Functions <#Filter-Functions>) and
process sentinels (see Sentinels <#Sentinels>).
Next: Standard Regexps <#Standard-Regexps>, Previous: Match Data
<#Match-Data>, Up: Searching and Matching <#Searching-and-Matching>
[Contents <#SEC_Contents>][Index <#Index>]
34.7 Search and Replace
If you want to find all matches for a regexp in part of the buffer, and
replace them, the best way is to write an explicit loop using
|re-search-forward| and |replace-match|, like this:
(while (re-search-forward "foo[ \t]+bar" nil t)
(replace-match "foobar"))
See Replacing the Text that Matched <#Replacing-Match>, for a
description of |replace-match|.
However, replacing matches in a string is more complex, especially if
you want to do it efficiently. So Emacs provides a function to do this.
Function: *replace-regexp-in-string* /regexp rep string &optional
fixedcase literal subexp start/
This function copies string and searches it for matches for regexp,
and replaces them with rep. It returns the modified copy. If start
is non-|nil|, the search for matches starts at that index in string,
and the returned value does not include the first start characters
of string. To get the whole transformed string, concatenate the
first start characters of string with the return value.
This function uses |replace-match| to do the replacement, and it
passes the optional arguments fixedcase, literal and subexp along to
|replace-match|.
Instead of a string, rep can be a function. In that case,
|replace-regexp-in-string| calls rep for each match, passing the
text of the match as its sole argument. It collects the value rep
returns and passes that to |replace-match| as the replacement
string. The match data at this point are the result of matching
regexp against a substring of string.
If you want to write a command along the lines of |query-replace|, you
can use |perform-replace| to do the work.
Function: *perform-replace* /from-string replacements query-flag
regexp-flag delimited-flag &optional repeat-count map start end backward
region-noncontiguous-p/
This function is the guts of |query-replace| and related commands.
It searches for occurrences of from-string in the text between
positions start and end and replaces some or all of them. If start
is |nil| (or omitted), point is used instead, and the end of the
buffer’s accessible portion is used for end. (If the optional
argument backward is non-|nil|, the search starts at end and goes
backward.)
If query-flag is |nil|, it replaces all occurrences; otherwise, it
asks the user what to do about each one.
If regexp-flag is non-|nil|, then from-string is considered a
regular expression; otherwise, it must match literally. If
delimited-flag is non-|nil|, then only replacements surrounded by
word boundaries are considered.
The argument replacements specifies what to replace occurrences
with. If it is a string, that string is used. It can also be a list
of strings, to be used in cyclic order.
If replacements is a cons cell, |(function . data)|, this means to
call function after each match to get the replacement text. This
function is called with two arguments: data, and the number of
replacements already made.
If repeat-count is non-|nil|, it should be an integer. Then it
specifies how many times to use each of the strings in the
replacements list before advancing cyclically to the next one.
If from-string contains upper-case letters, then |perform-replace|
binds |case-fold-search| to |nil|, and it uses the replacements
without altering their case.
Normally, the keymap |query-replace-map| defines the possible user
responses for queries. The argument map, if non-|nil|, specifies a
keymap to use instead of |query-replace-map|.
Non-|nil| region-noncontiguous-p means that the region between start
and end is composed of noncontiguous pieces. The most common example
of this is a rectangular region, where the pieces are separated by
newline characters.
This function uses one of two functions to search for the next
occurrence of from-string. These functions are specified by the
values of two variables: |replace-re-search-function| and
|replace-search-function|. The former is called when the argument
regexp-flag is non-|nil|, the latter when it is |nil|.
Variable: *query-replace-map*
This variable holds a special keymap that defines the valid user
responses for |perform-replace| and the commands that use it, as
well as |y-or-n-p| and |map-y-or-n-p|. This map is unusual in two ways:
* The key bindings are not commands, just symbols that are
meaningful to the functions that use this map.
* Prefix keys are not supported; each key binding must be for a
single-event key sequence. This is because the functions don’t
use |read-key-sequence| to get the input; instead, they read a
single event and look it up “by hand”.
Here are the meaningful bindings for |query-replace-map|. Several of
them are meaningful only for |query-replace| and friends.
|act|
Do take the action being considered—in other words, “yes”.
|skip|
Do not take action for this question—in other words, “no”.
|exit|
Answer this question “no”, and give up on the entire series of
questions, assuming that the answers will be “no”.
|exit-prefix|
Like |exit|, but add the key that was pressed to
|unread-command-events| (see Event Input Misc <#Event-Input-Misc>).
|act-and-exit|
Answer this question “yes”, and give up on the entire series of
questions, assuming that subsequent answers will be “no”.
|act-and-show|
Answer this question “yes”, but show the results—don’t advance yet
to the next question.
|automatic|
Answer this question and all subsequent questions in the series with
“yes”, without further user interaction.
|backup|
Move back to the previous place that a question was asked about.
|undo|
Undo last replacement and move back to the place where that
replacement was performed.
|undo-all|
Undo all replacements and move back to the place where the first
replacement was performed.
|edit|
Enter a recursive edit to deal with this question—instead of any
other action that would normally be taken.
|edit-replacement|
Edit the replacement for this question in the minibuffer.
|delete-and-edit|
Delete the text being considered, then enter a recursive edit to
replace it.
|recenter|
|scroll-up|
|scroll-down|
|scroll-other-window|
|scroll-other-window-down|
Perform the specified window scroll operation, then ask the same
question again. Only |y-or-n-p| and related functions use this answer.
|quit|
Perform a quit right away. Only |y-or-n-p| and related functions use
this answer.
|help|
Display some help, then ask again.
Variable: *multi-query-replace-map*
This variable holds a keymap that extends |query-replace-map| by
providing additional keybindings that are useful in multi-buffer
replacements. The additional bindings are:
|automatic-all|
Answer this question and all subsequent questions in the series
with “yes”, without further user interaction, for all remaining
buffers.
|exit-current|
Answer this question “no”, and give up on the entire series of
questions for the current buffer. Continue to the next buffer in
the sequence.
Variable: *replace-search-function*
This variable specifies a function that |perform-replace| calls to
search for the next string to replace. Its default value is
|search-forward|. Any other value should name a function of 3
arguments: the first 3 arguments of |search-forward| (see String
Search <#String-Search>).
Variable: *replace-re-search-function*
This variable specifies a function that |perform-replace| calls to
search for the next regexp to replace. Its default value is
|re-search-forward|. Any other value should name a function of 3
arguments: the first 3 arguments of |re-search-forward| (see Regexp
Search <#Regexp-Search>).
Previous: Search and Replace <#Search-and-Replace>, Up: Searching and
Matching <#Searching-and-Matching> [Contents <#SEC_Contents>][Index
<#Index>]
34.8 Standard Regular Expressions Used in Editing
This section describes some variables that hold regular expressions used
for certain purposes in editing:
User Option: *page-delimiter*
This is the regular expression describing line-beginnings that
separate pages. The default value is |"^\014"| (i.e., |"^^L"| or
|"^\C-l"|); this matches a line that starts with a formfeed character.
The following two regular expressions should /not/ assume the match
always starts at the beginning of a line; they should not use ‘^’ to
anchor the match. Most often, the paragraph commands do check for a
match only at the beginning of a line, which means that ‘^’ would be
superfluous. When there is a nonzero left margin, they accept matches
that start after the left margin. In that case, a ‘^’ would be
incorrect. However, a ‘^’ is harmless in modes where a left margin is
never used.
User Option: *paragraph-separate*
This is the regular expression for recognizing the beginning of a
line that separates paragraphs. (If you change this, you may have to
change |paragraph-start| also.) The default value is |"[ \t\f]*$"|,
which matches a line that consists entirely of spaces, tabs, and
form feeds (after its left margin).
User Option: *paragraph-start*
This is the regular expression for recognizing the beginning of a
line that starts /or/ separates paragraphs. The default value is
|"\f\\|[ \t]*$"|, which matches a line containing only whitespace or
starting with a form feed (after its left margin).
User Option: *sentence-end*
If non-|nil|, the value should be a regular expression describing
the end of a sentence, including the whitespace following the
sentence. (All paragraph boundaries also end sentences, regardless.)
If the value is |nil|, as it is by default, then the function
|sentence-end| constructs the regexp. That is why you should always
call the function |sentence-end| to obtain the regexp to be used to
recognize the end of a sentence.
Function: *sentence-end*
This function returns the value of the variable |sentence-end|, if
non-|nil|. Otherwise it returns a default value based on the values
of the variables |sentence-end-double-space| (see Definition of
sentence-end-double-space
<#Definition-of-sentence_002dend_002ddouble_002dspace>),
|sentence-end-without-period|, and |sentence-end-without-space|.
Next: Abbrevs <#Abbrevs>, Previous: Searching and Matching
<#Searching-and-Matching>, Up: Top <#Top> [Contents
<#SEC_Contents>][Index <#Index>]
35 Syntax Tables
A /syntax table/ specifies the syntactic role of each character in a
buffer. It can be used to determine where words, symbols, and other
syntactic constructs begin and end. This information is used by many
Emacs facilities, including Font Lock mode (see Font Lock Mode
<#Font-Lock-Mode>) and the various complex movement commands (see Motion
<#Motion>).
• Basics <#Syntax-Basics> Basic concepts of syntax tables.
• Syntax Descriptors <#Syntax-Descriptors> How characters are
classified.
• Syntax Table Functions <#Syntax-Table-Functions> How to create,
examine and alter syntax tables.
• Syntax Properties <#Syntax-Properties> Overriding syntax with text
properties.
• Motion and Syntax <#Motion-and-Syntax> Moving over characters with
certain syntaxes.
• Parsing Expressions <#Parsing-Expressions> Parsing balanced
expressions using the syntax table.
• Syntax Table Internals <#Syntax-Table-Internals> How syntax table
information is stored.
• Categories <#Categories> Another way of classifying character syntax.
Next: Syntax Descriptors <#Syntax-Descriptors>, Up: Syntax Tables
<#Syntax-Tables> [Contents <#SEC_Contents>][Index <#Index>]
35.1 Syntax Table Concepts
A syntax table is a data structure which can be used to look up the
/syntax class/ and other syntactic properties of each character. Syntax
tables are used by Lisp programs for scanning and moving across text.
Internally, a syntax table is a char-table (see Char-Tables
<#Char_002dTables>). The element at index c describes the character with
code c; its value is a cons cell which specifies the syntax of the
character in question. See Syntax Table Internals
<#Syntax-Table-Internals>, for details. However, instead of using |aset|
and |aref| to modify and inspect syntax table contents, you should
usually use the higher-level functions |char-syntax| and
|modify-syntax-entry|, which are described in Syntax Table Functions
<#Syntax-Table-Functions>.
Function: *syntax-table-p* /object/
This function returns |t| if object is a syntax table.
Each buffer has its own major mode, and each major mode has its own idea
of the syntax class of various characters. For example, in Lisp mode,
the character ‘;’ begins a comment, but in C mode, it terminates a
statement. To support these variations, the syntax table is local to
each buffer. Typically, each major mode has its own syntax table, which
it installs in all buffers that use that mode. For example, the variable
|emacs-lisp-mode-syntax-table| holds the syntax table used by Emacs Lisp
mode, and |c-mode-syntax-table| holds the syntax table used by C mode.
Changing a major mode’s syntax table alters the syntax in all of that
mode’s buffers, as well as in any buffers subsequently put in that mode.
Occasionally, several similar modes share one syntax table. See Example
Major Modes <#Example-Major-Modes>, for an example of how to set up a
syntax table.
A syntax table can /inherit/ from another syntax table, which is called
its /parent syntax table/. A syntax table can leave the syntax class of
some characters unspecified, by giving them the “inherit” syntax class;
such a character then acquires the syntax class specified by the parent
syntax table (see Syntax Class Table <#Syntax-Class-Table>). Emacs
defines a /standard syntax table/, which is the default parent syntax
table, and is also the syntax table used by Fundamental mode.
Function: *standard-syntax-table*
This function returns the standard syntax table, which is the syntax
table used in Fundamental mode.
Syntax tables are not used by the Emacs Lisp reader, which has its own
built-in syntactic rules which cannot be changed. (Some Lisp systems
provide ways to redefine the read syntax, but we decided to leave this
feature out of Emacs Lisp for simplicity.)
Next: Syntax Table Functions <#Syntax-Table-Functions>, Previous: Syntax
Basics <#Syntax-Basics>, Up: Syntax Tables <#Syntax-Tables> [Contents
<#SEC_Contents>][Index <#Index>]
35.2 Syntax Descriptors
The /syntax class/ of a character describes its syntactic role. Each
syntax table specifies the syntax class of each character. There is no
necessary relationship between the class of a character in one syntax
table and its class in any other table.
Each syntax class is designated by a mnemonic character, which serves as
the name of the class when you need to specify a class. Usually, this
designator character is one that is often assigned that class; however,
its meaning as a designator is unvarying and independent of what syntax
that character currently has. Thus, ‘\’ as a designator character always
stands for escape character syntax, regardless of whether the ‘\’
character actually has that syntax in the current syntax table. See
Syntax Class Table <#Syntax-Class-Table>, for a list of syntax classes
and their designator characters.
A /syntax descriptor/ is a Lisp string that describes the syntax class
and other syntactic properties of a character. When you want to modify
the syntax of a character, that is done by calling the function
|modify-syntax-entry| and passing a syntax descriptor as one of its
arguments (see Syntax Table Functions <#Syntax-Table-Functions>).
The first character in a syntax descriptor must be a syntax class
designator character. The second character, if present, specifies a
matching character (e.g., in Lisp, the matching character for ‘(’ is
‘)’); a space specifies that there is no matching character. Then come
characters specifying additional syntax properties (see Syntax Flags
<#Syntax-Flags>).
If no matching character or flags are needed, only one character
(specifying the syntax class) is sufficient.
For example, the syntax descriptor for the character ‘*’ in C mode is
|". 23"| (i.e., punctuation, matching character slot unused, second
character of a comment-starter, first character of a comment-ender), and
the entry for ‘/’ is ‘. 14’ (i.e., punctuation, matching character slot
unused, first character of a comment-starter, second character of a
comment-ender).
Emacs also defines /raw syntax descriptors/, which are used to describe
syntax classes at a lower level. See Syntax Table Internals
<#Syntax-Table-Internals>.
• Syntax Class Table <#Syntax-Class-Table> Table of syntax classes.
• Syntax Flags <#Syntax-Flags> Additional flags each character can have.
Next: Syntax Flags <#Syntax-Flags>, Up: Syntax Descriptors
<#Syntax-Descriptors> [Contents <#SEC_Contents>][Index <#Index>]
35.2.1 Table of Syntax Classes
Here is a table of syntax classes, the characters that designate them,
their meanings, and examples of their use.
Whitespace characters: ‘ ’ or ‘-’
Characters that separate symbols and words from each other.
Typically, whitespace characters have no other syntactic
significance, and multiple whitespace characters are syntactically
equivalent to a single one. Space, tab, and formfeed are classified
as whitespace in almost all major modes.
This syntax class can be designated by either ‘ ’ or ‘-’. Both
designators are equivalent.
Word constituents: ‘w’
Parts of words in human languages. These are typically used in
variable and command names in programs. All upper- and lower-case
letters, and the digits, are typically word constituents.
Symbol constituents: ‘_’
Extra characters used in variable and command names along with word
constituents. Examples include the characters ‘$&*+-_<>’ in Lisp
mode, which may be part of a symbol name even though they are not
part of English words. In standard C, the only non-word-constituent
character that is valid in symbols is underscore (‘_’).
Punctuation characters: ‘.’
Characters used as punctuation in a human language, or used in a
programming language to separate symbols from one another. Some
programming language modes, such as Emacs Lisp mode, have no
characters in this class since the few characters that are not
symbol or word constituents all have other uses. Other programming
language modes, such as C mode, use punctuation syntax for operators.
Open parenthesis characters: ‘(’
Close parenthesis characters: ‘)’
Characters used in dissimilar pairs to surround sentences or
expressions. Such a grouping is begun with an open parenthesis
character and terminated with a close. Each open parenthesis
character matches a particular close parenthesis character, and vice
versa. Normally, Emacs indicates momentarily the matching open
parenthesis when you insert a close parenthesis. See Blinking
<#Blinking>.
In human languages, and in C code, the parenthesis pairs are ‘()’,
‘[]’, and ‘{}’. In Emacs Lisp, the delimiters for lists and vectors
(‘()’ and ‘[]’) are classified as parenthesis characters.
String quotes: ‘"’
Characters used to delimit string constants. The same string quote
character appears at the beginning and the end of a string. Such
quoted strings do not nest.
The parsing facilities of Emacs consider a string as a single token.
The usual syntactic meanings of the characters in the string are
suppressed.
The Lisp modes have two string quote characters: double-quote (‘"’)
and vertical bar (‘|’). ‘|’ is not used in Emacs Lisp, but it is
used in Common Lisp. C also has two string quote characters:
double-quote for strings, and apostrophe (‘'’) for character constants.
Human text has no string quote characters. We do not want quotation
marks to turn off the usual syntactic properties of other characters
in the quotation.
Escape-syntax characters: ‘\’
Characters that start an escape sequence, such as is used in string
and character constants. The character ‘\’ belongs to this class in
both C and Lisp. (In C, it is used thus only inside strings, but it
turns out to cause no trouble to treat it this way throughout C code.)
Characters in this class count as part of words if
|words-include-escapes| is non-|nil|. See Word Motion <#Word-Motion>.
Character quotes: ‘/’
Characters used to quote the following character so that it loses
its normal syntactic meaning. This differs from an escape character
in that only the character immediately following is ever affected.
Characters in this class count as part of words if
|words-include-escapes| is non-|nil|. See Word Motion <#Word-Motion>.
This class is used for backslash in TeX mode.
Paired delimiters: ‘$’
Similar to string quote characters, except that the syntactic
properties of the characters between the delimiters are not
suppressed. Only TeX mode uses a paired delimiter presently—the ‘$’
that both enters and leaves math mode.
Expression prefixes: ‘'’
Characters used for syntactic operators that are considered as part
of an expression if they appear next to one. In Lisp modes, these
characters include the apostrophe, ‘'’ (used for quoting), the
comma, ‘,’ (used in macros), and ‘#’ (used in the read syntax for
certain data types).
Comment starters: ‘<’
Comment enders: ‘>’
Characters used in various languages to delimit comments. Human text
has no comment characters. In Lisp, the semicolon (‘;’) starts a
comment and a newline or formfeed ends one.
Inherit standard syntax: ‘@’
This syntax class does not specify a particular syntax. It says to
look in the standard syntax table to find the syntax of this character.
Generic comment delimiters: ‘!’
(This syntax class is also known as “comment-fence”.) Characters
that start or end a special kind of comment. /Any/ generic comment
delimiter matches /any/ generic comment delimiter, but they cannot
match a comment starter or comment ender; generic comment delimiters
can only match each other.
This syntax class is primarily meant for use with the |syntax-table|
text property (see Syntax Properties <#Syntax-Properties>). You can
mark any range of characters as forming a comment, by giving the
first and last characters of the range |syntax-table| properties
identifying them as generic comment delimiters.
Generic string delimiters: ‘|’
(This syntax class is also known as “string-fence”.) Characters that
start or end a string. This class differs from the string quote
class in that /any/ generic string delimiter can match any other
generic string delimiter; but they do not match ordinary string
quote characters.
This syntax class is primarily meant for use with the |syntax-table|
text property (see Syntax Properties <#Syntax-Properties>). You can
mark any range of characters as forming a string constant, by giving
the first and last characters of the range |syntax-table| properties
identifying them as generic string delimiters.
Previous: Syntax Class Table <#Syntax-Class-Table>, Up: Syntax
Descriptors <#Syntax-Descriptors> [Contents <#SEC_Contents>][Index
<#Index>]
35.2.2 Syntax Flags
In addition to the classes, entries for characters in a syntax table can
specify flags. There are eight possible flags, represented by the
characters ‘1’, ‘2’, ‘3’, ‘4’, ‘b’, ‘c’, ‘n’, and ‘p’.
All the flags except ‘p’ are used to describe comment delimiters. The
digit flags are used for comment delimiters made up of 2 characters.
They indicate that a character can /also/ be part of a comment sequence,
in addition to the syntactic properties associated with its character
class. The flags are independent of the class and each other for the
sake of characters such as ‘*’ in C mode, which is a punctuation
character, /and/ the second character of a start-of-comment sequence
(‘/*’), /and/ the first character of an end-of-comment sequence (‘*/’).
The flags ‘b’, ‘c’, and ‘n’ are used to qualify the corresponding
comment delimiter.
Here is a table of the possible flags for a character c, and what they
mean:
* ‘1’ means c is the start of a two-character comment-start sequence.
* ‘2’ means c is the second character of such a sequence.
* ‘3’ means c is the start of a two-character comment-end sequence.
* ‘4’ means c is the second character of such a sequence.
* ‘b’ means that c as a comment delimiter belongs to the alternative
“b” comment style. For a two-character comment starter, this flag is
only significant on the second char, and for a 2-character comment
ender it is only significant on the first char.
* ‘c’ means that c as a comment delimiter belongs to the alternative
“c” comment style. For a two-character comment delimiter, ‘c’ on
either character makes it of style “c”.
* ‘n’ on a comment delimiter character specifies that this kind of
comment can be nested. Inside such a comment, only comments of the
same style will be recognized. For a two-character comment
delimiter, ‘n’ on either character makes it nestable.
Emacs supports several comment styles simultaneously in any one
syntax table. A comment style is a set of flags ‘b’, ‘c’, and ‘n’,
so there can be up to 8 different comment styles, each one named by
the set of its flags. Each comment delimiter has a style and only
matches comment delimiters of the same style. Thus if a comment
starts with the comment-start sequence of style “bn”, it will extend
until the next matching comment-end sequence of style “bn”. When the
set of flags has neither flag ‘b’ nor flag ‘c’ set, the resulting
style is called the “a” style.
The appropriate comment syntax settings for C++ can be as follows:
‘/’
‘124’
‘*’
‘23b’
newline
‘>’
This defines four comment-delimiting sequences:
‘/*’
This is a comment-start sequence for “b” style because the
second character, ‘*’, has the ‘b’ flag.
‘//’
This is a comment-start sequence for “a” style because the
second character, ‘/’, does not have the ‘b’ flag.
‘*/’
This is a comment-end sequence for “b” style because the first
character, ‘*’, has the ‘b’ flag.
newline
This is a comment-end sequence for “a” style, because the
newline character does not have the ‘b’ flag.
* ‘p’ identifies an additional prefix character for Lisp syntax. These
characters are treated as whitespace when they appear between
expressions. When they appear within an expression, they are handled
according to their usual syntax classes.
The function |backward-prefix-chars| moves back over these
characters, as well as over characters whose primary syntax class is
prefix (‘'’). See Motion and Syntax <#Motion-and-Syntax>.
Next: Syntax Properties <#Syntax-Properties>, Previous: Syntax
Descriptors <#Syntax-Descriptors>, Up: Syntax Tables <#Syntax-Tables>
[Contents <#SEC_Contents>][Index <#Index>]
35.3 Syntax Table Functions
In this section we describe functions for creating, accessing and
altering syntax tables.
Function: *make-syntax-table* /&optional table/
This function creates a new syntax table. If table is non-|nil|, the
parent of the new syntax table is table; otherwise, the parent is
the standard syntax table.
In the new syntax table, all characters are initially given the
“inherit” (‘@’) syntax class, i.e., their syntax is inherited from
the parent table (see Syntax Class Table <#Syntax-Class-Table>).
Function: *copy-syntax-table* /&optional table/
This function constructs a copy of table and returns it. If table is
omitted or |nil|, it returns a copy of the standard syntax table.
Otherwise, an error is signaled if table is not a syntax table.
Command: *modify-syntax-entry* /char syntax-descriptor &optional table/
This function sets the syntax entry for char according to
syntax-descriptor. char must be a character, or a cons cell of the
form |(min . max)|; in the latter case, the function sets the syntax
entries for all characters in the range between min and max, inclusive.
The syntax is changed only for table, which defaults to the current
buffer’s syntax table, and not in any other syntax table.
The argument syntax-descriptor is a syntax descriptor, i.e., a
string whose first character is a syntax class designator and whose
second and subsequent characters optionally specify a matching
character and syntax flags. See Syntax Descriptors
<#Syntax-Descriptors>. An error is signaled if syntax-descriptor is
not a valid syntax descriptor.
This function always returns |nil|. The old syntax information in
the table for this character is discarded.
Examples:
;; Put the space character in class whitespace.
(modify-syntax-entry ?\s " ")
⇒ nil
;; Make ‘$’ an open parenthesis character,
;; with ‘^’ as its matching close.
(modify-syntax-entry ?$ "(^")
⇒ nil
;; Make ‘^’ a close parenthesis character,
;; with ‘$’ as its matching open.
(modify-syntax-entry ?^ ")$")
⇒ nil
;; Make ‘/’ a punctuation character,
;; the first character of a start-comment sequence,
;; and the second character of an end-comment sequence.
;; This is used in C mode.
(modify-syntax-entry ?/ ". 14")
⇒ nil
Function: *char-syntax* /character/
This function returns the syntax class of character, represented by
its designator character (see Syntax Class Table
<#Syntax-Class-Table>). This returns /only/ the class, not its
matching character or syntax flags.
The following examples apply to C mode. (We use |string| to make it
easier to see the character returned by |char-syntax|.)
;; Space characters have whitespace syntax class.
(string (char-syntax ?\s))
⇒ " "
;; Forward slash characters have punctuation syntax.
;; Note that this |char-syntax| call does not reveal
;; that it is also part of comment-start and -end sequences.
(string (char-syntax ?/))
⇒ "."
;; Open parenthesis characters have open parenthesis syntax.
;; Note that this |char-syntax| call does not reveal that
;; it has a matching character, ‘)’.
(string (char-syntax ?\())
⇒ "("
Function: *set-syntax-table* /table/
This function makes table the syntax table for the current buffer.
It returns table.
Function: *syntax-table*
This function returns the current syntax table, which is the table
for the current buffer.
Command: *describe-syntax* /&optional buffer/
This command displays the contents of the syntax table of buffer (by
default, the current buffer) in a help buffer.
Macro: *with-syntax-table* /table body…/
This macro executes body using table as the current syntax table. It
returns the value of the last form in body, after restoring the old
current syntax table.
Since each buffer has its own current syntax table, we should make
that more precise: |with-syntax-table| temporarily alters the
current syntax table of whichever buffer is current at the time the
macro execution starts. Other buffers are not affected.
Next: Motion and Syntax <#Motion-and-Syntax>, Previous: Syntax Table
Functions <#Syntax-Table-Functions>, Up: Syntax Tables <#Syntax-Tables>
[Contents <#SEC_Contents>][Index <#Index>]
35.4 Syntax Properties
When the syntax table is not flexible enough to specify the syntax of a
language, you can override the syntax table for specific character
occurrences in the buffer, by applying a |syntax-table| text property.
See Text Properties <#Text-Properties>, for how to apply text properties.
The valid values of |syntax-table| text property are:
syntax-table
If the property value is a syntax table, that table is used instead
of the current buffer’s syntax table to determine the syntax for the
underlying text character.
|(syntax-code . matching-char)|
A cons cell of this format is a raw syntax descriptor (see Syntax
Table Internals <#Syntax-Table-Internals>), which directly specifies
a syntax class for the underlying text character.
|nil|
If the property is |nil|, the character’s syntax is determined from
the current syntax table in the usual way.
Variable: *parse-sexp-lookup-properties*
If this is non-|nil|, the syntax scanning functions, like
|forward-sexp|, pay attention to |syntax-table| text properties.
Otherwise they use only the current syntax table.
Variable: *syntax-propertize-function*
This variable, if non-|nil|, should store a function for applying
|syntax-table| properties to a specified stretch of text. It is
intended to be used by major modes to install a function which
applies |syntax-table| properties in some mode-appropriate way.
The function is called by |syntax-ppss| (see Position Parse
<#Position-Parse>), and by Font Lock mode during syntactic
fontification (see Syntactic Font Lock <#Syntactic-Font-Lock>). It
is called with two arguments, start and end, which are the starting
and ending positions of the text on which it should act. It is
allowed to call |syntax-ppss| on any position before end, but if a
Lisp program calls |syntax-ppss| on some position and later modifies
the buffer at some earlier position, then it is that program’s
responsibility to call |syntax-ppss-flush-cache| to flush the now
obsolete info from the cache.
*Caution:* When this variable is non-|nil|, Emacs removes
|syntax-table| text properties arbitrarily and relies on
|syntax-propertize-function| to reapply them. Thus if this facility
is used at all, the function must apply *all* |syntax-table| text
properties used by the major mode. In particular, Modes derived from
a CC Mode mode must not use this variable, since CC Mode uses other
means to apply and remove these text properties.
Variable: *syntax-propertize-extend-region-functions*
This abnormal hook is run by the syntax parsing code prior to
calling |syntax-propertize-function|. Its role is to help locate
safe starting and ending buffer positions for passing to
|syntax-propertize-function|. For example, a major mode can add a
function to this hook to identify multi-line syntactic constructs,
and ensure that the boundaries do not fall in the middle of one.
Each function in this hook should accept two arguments, start and
end. It should return either a cons cell of two adjusted buffer
positions, |(new-start . new-end)|, or |nil| if no adjustment is
necessary. The hook functions are run in turn, repeatedly, until
they all return |nil|.
Next: Parsing Expressions <#Parsing-Expressions>, Previous: Syntax
Properties <#Syntax-Properties>, Up: Syntax Tables <#Syntax-Tables>
[Contents <#SEC_Contents>][Index <#Index>]
35.5 Motion and Syntax
This section describes functions for moving across characters that have
certain syntax classes.
Function: *skip-syntax-forward* /syntaxes &optional limit/
This function moves point forward across characters having syntax
classes mentioned in syntaxes (a string of syntax class characters).
It stops when it encounters the end of the buffer, or position limit
(if specified), or a character it is not supposed to skip.
If syntaxes starts with ‘^’, then the function skips characters
whose syntax is /not/ in syntaxes.
The return value is the distance traveled, which is a nonnegative
integer.
Function: *skip-syntax-backward* /syntaxes &optional limit/
This function moves point backward across characters whose syntax
classes are mentioned in syntaxes. It stops when it encounters the
beginning of the buffer, or position limit (if specified), or a
character it is not supposed to skip.
If syntaxes starts with ‘^’, then the function skips characters
whose syntax is /not/ in syntaxes.
The return value indicates the distance traveled. It is an integer
that is zero or less.
Function: *backward-prefix-chars*
This function moves point backward over any number of characters
with expression prefix syntax. This includes both characters in the
expression prefix syntax class, and characters with the ‘p’ flag.
Next: Syntax Table Internals <#Syntax-Table-Internals>, Previous: Motion
and Syntax <#Motion-and-Syntax>, Up: Syntax Tables <#Syntax-Tables>
[Contents <#SEC_Contents>][Index <#Index>]
35.6 Parsing Expressions
This section describes functions for parsing and scanning balanced
expressions. We will refer to such expressions as /sexps/, following the
terminology of Lisp, even though these functions can act on languages
other than Lisp. Basically, a sexp is either a balanced parenthetical
grouping, a string, or a symbol (i.e., a sequence of characters whose
syntax is either word constituent or symbol constituent). However,
characters in the expression prefix syntax class (see Syntax Class Table
<#Syntax-Class-Table>) are treated as part of the sexp if they appear
next to it.
The syntax table controls the interpretation of characters, so these
functions can be used for Lisp expressions when in Lisp mode and for C
expressions when in C mode. See List Motion <#List-Motion>, for
convenient higher-level functions for moving over balanced expressions.
A character’s syntax controls how it changes the state of the parser,
rather than describing the state itself. For example, a string delimiter
character toggles the parser state between in-string and in-code, but
the syntax of characters does not directly say whether they are inside a
string. For example (note that 15 is the syntax code for generic string
delimiters),
(put-text-property 1 9 'syntax-table '(15 . nil))
does not tell Emacs that the first eight chars of the current buffer are
a string, but rather that they are all string delimiters. As a result,
Emacs treats them as four consecutive empty string constants.
• Motion via Parsing <#Motion-via-Parsing> Motion functions that work
by parsing.
• Position Parse <#Position-Parse> Determining the syntactic state of
a position.
• Parser State <#Parser-State> How Emacs represents a syntactic state.
• Low-Level Parsing <#Low_002dLevel-Parsing> Parsing across a
specified region.
• Control Parsing <#Control-Parsing> Parameters that affect parsing.
Next: Position Parse <#Position-Parse>, Up: Parsing Expressions
<#Parsing-Expressions> [Contents <#SEC_Contents>][Index <#Index>]
35.6.1 Motion Commands Based on Parsing
This section describes simple point-motion functions that operate based
on parsing expressions.
Function: *scan-lists* /from count depth/
This function scans forward count balanced parenthetical groupings
from position from. It returns the position where the scan stops. If
count is negative, the scan moves backwards.
If depth is nonzero, treat the starting position as being depth
parentheses deep. The scanner moves forward or backward through the
buffer until the depth changes to zero count times. Hence, a
positive value for depth has the effect of moving out depth levels
of parenthesis from the starting position, while a negative depth
has the effect of moving deeper by -depth levels of parenthesis.
Scanning ignores comments if |parse-sexp-ignore-comments| is non-|nil|.
If the scan reaches the beginning or end of the accessible part of
the buffer before it has scanned over count parenthetical groupings,
the return value is |nil| if the depth at that point is zero; if the
depth is non-zero, a |scan-error| error is signaled.
Function: *scan-sexps* /from count/
This function scans forward count sexps from position from. It
returns the position where the scan stops. If count is negative, the
scan moves backwards.
Scanning ignores comments if |parse-sexp-ignore-comments| is non-|nil|.
If the scan reaches the beginning or end of (the accessible part of)
the buffer while in the middle of a parenthetical grouping, an error
is signaled. If it reaches the beginning or end between groupings
but before count is used up, |nil| is returned.
Function: *forward-comment* /count/
This function moves point forward across count complete comments
(that is, including the starting delimiter and the terminating
delimiter if any), plus any whitespace encountered on the way. It
moves backward if count is negative. If it encounters anything other
than a comment or whitespace, it stops, leaving point at the place
where it stopped. This includes (for instance) finding the end of a
comment when moving forward and expecting the beginning of one. The
function also stops immediately after moving over the specified
number of complete comments. If count comments are found as
expected, with nothing except whitespace between them, it returns
|t|; otherwise it returns |nil|.
This function cannot tell whether the comments it traverses are
embedded within a string. If they look like comments, it treats them
as comments.
To move forward over all comments and whitespace following point,
use |(forward-comment (buffer-size))|. |(buffer-size)| is a good
argument to use, because the number of comments in the buffer cannot
exceed that many.
Next: Parser State <#Parser-State>, Previous: Motion via Parsing
<#Motion-via-Parsing>, Up: Parsing Expressions <#Parsing-Expressions>
[Contents <#SEC_Contents>][Index <#Index>]
35.6.2 Finding the Parse State for a Position
For syntactic analysis, such as in indentation, often the useful thing
is to compute the syntactic state corresponding to a given buffer
position. This function does that conveniently.
Function: *syntax-ppss* /&optional pos/
This function returns the parser state that the parser would reach
at position pos starting from the beginning of the visible portion
of the buffer. See Parser State <#Parser-State>, for a description
of the parser state.
The return value is the same as if you call the low-level parsing
function |parse-partial-sexp| to parse from the beginning of the
visible portion of the buffer to pos (see Low-Level Parsing
<#Low_002dLevel-Parsing>). However, |syntax-ppss| uses caches to
speed up the computation. Due to this optimization, the second value
(previous complete subexpression) and sixth value (minimum
parenthesis depth) in the returned parser state are not meaningful.
This function has a side effect: it adds a buffer-local entry to
|before-change-functions| (see Change Hooks <#Change-Hooks>) for
|syntax-ppss-flush-cache| (see below). This entry keeps the cache
consistent as the buffer is modified. However, the cache might not
be updated if |syntax-ppss| is called while
|before-change-functions| is temporarily let-bound, or if the buffer
is modified without running the hook, such as when using
|inhibit-modification-hooks|. In those cases, it is necessary to
call |syntax-ppss-flush-cache| explicitly.
Function: *syntax-ppss-flush-cache* /beg &rest ignored-args/
This function flushes the cache used by |syntax-ppss|, starting at
position beg. The remaining arguments, ignored-args, are ignored;
this function accepts them so that it can be directly used on hooks
such as |before-change-functions| (see Change Hooks <#Change-Hooks>).
Next: Low-Level Parsing <#Low_002dLevel-Parsing>, Previous: Position
Parse <#Position-Parse>, Up: Parsing Expressions <#Parsing-Expressions>
[Contents <#SEC_Contents>][Index <#Index>]
35.6.3 Parser State
A /parser state/ is a list of (currently) eleven elements describing the
state of the syntactic parser, after it parses the text between a
specified starting point and a specified end point in the buffer using
|parse-partial-sexp| (see Low-Level Parsing <#Low_002dLevel-Parsing>).
Parsing functions such as |syntax-ppss| (see Position Parse
<#Position-Parse>) also return a parser state as the value.
|parse-partial-sexp| can accept a parser state as an argument, for
resuming parsing.
Here are the meanings of the elements of the parser state:
0. The depth in parentheses, counting from 0. *Warning:* this can be
negative if there are more close parens than open parens between the
parser’s starting point and end point.
1. The character position of the start of the innermost parenthetical
grouping containing the stopping point; |nil| if none.
2. The character position of the start of the last complete
subexpression terminated; |nil| if none.
3. Non-|nil| if inside a string. More precisely, this is the character
that will terminate the string, or |t| if a generic string delimiter
character should terminate it.
4. |t| if inside a non-nestable comment (of any comment style; see
Syntax Flags <#Syntax-Flags>); or the comment nesting level if
inside a comment that can be nested.
5. |t| if the end point is just after a quote character.
6. The minimum parenthesis depth encountered during this scan.
7. What kind of comment is active: |nil| if not in a comment or in a
comment of style ‘a’; 1 for a comment of style ‘b’; 2 for a comment
of style ‘c’; and |syntax-table| for a comment that should be ended
by a generic comment delimiter character.
8. The string or comment start position. While inside a comment, this
is the position where the comment began; while inside a string, this
is the position where the string began. When outside of strings and
comments, this element is |nil|.
9. The list of the positions of the currently open parentheses,
starting with the outermost.
10. When the last buffer position scanned was the (potential) first
character of a two character construct (comment delimiter or
escaped/char-quoted character pair), the syntax-code (see Syntax
Table Internals <#Syntax-Table-Internals>) of that position.
Otherwise |nil|.
Elements 1, 2, and 6 are ignored in a state which you pass as an
argument to |parse-partial-sexp| to continue parsing. Elements 9 and 10
are mainly used internally by the parser code.
Some additional useful information is available from a parser state
using these functions:
Function: *syntax-ppss-toplevel-pos* /state/
This function extracts, from parser state state, the last position
scanned in the parse which was at top level in grammatical
structure. “At top level” means outside of any parentheses,
comments, or strings.
The value is |nil| if state represents a parse which has arrived at
a top level position.
Function: *syntax-ppss-context* /state/
Return |string| if the end position of the scan returning state is
in a string, and |comment| if it’s in a comment.
Next: Control Parsing <#Control-Parsing>, Previous: Parser State
<#Parser-State>, Up: Parsing Expressions <#Parsing-Expressions>
[Contents <#SEC_Contents>][Index <#Index>]
35.6.4 Low-Level Parsing
The most basic way to use the expression parser is to tell it to start
at a given position with a certain state, and parse up to a specified
end position.
Function: *parse-partial-sexp* /start limit &optional target-depth
stop-before state stop-comment/
This function parses a sexp in the current buffer starting at start,
not scanning past limit. It stops at position limit or when certain
criteria described below are met, and sets point to the location
where parsing stops. It returns a parser state describing the status
of the parse at the point where it stops.
If the third argument target-depth is non-|nil|, parsing stops if
the depth in parentheses becomes equal to target-depth. The depth
starts at 0, or at whatever is given in state.
If the fourth argument stop-before is non-|nil|, parsing stops when
it comes to any character that starts a sexp. If stop-comment is
non-|nil|, parsing stops after the start of an unnested comment. If
stop-comment is the symbol |syntax-table|, parsing stops after the
start of an unnested comment or a string, or after the end of an
unnested comment or a string, whichever comes first.
If state is |nil|, start is assumed to be at the top level of
parenthesis structure, such as the beginning of a function
definition. Alternatively, you might wish to resume parsing in the
middle of the structure. To do this, you must provide a state
argument that describes the initial status of parsing. The value
returned by a previous call to |parse-partial-sexp| will do nicely.
Previous: Low-Level Parsing <#Low_002dLevel-Parsing>, Up: Parsing
Expressions <#Parsing-Expressions> [Contents <#SEC_Contents>][Index
<#Index>]
35.6.5 Parameters to Control Parsing
Variable: *multibyte-syntax-as-symbol*
If this variable is non-|nil|, |scan-sexps| treats all non-ASCII
characters as symbol constituents regardless of what the syntax
table says about them. (However, |syntax-table |text properties can
still override the syntax.)
User Option: *parse-sexp-ignore-comments*
If the value is non-|nil|, then comments are treated as whitespace
by the functions in this section and by |forward-sexp|, |scan-lists|
and |scan-sexps|.
The behavior of |parse-partial-sexp| is also affected by
|parse-sexp-lookup-properties| (see Syntax Properties
<#Syntax-Properties>).
Variable: *comment-end-can-be-escaped*
If this buffer local variable is non-|nil|, a single character which
usually terminates a comment doesn’t do so when that character is
escaped. This is used in C and C++ Modes, where line comments
starting with ‘//’ can be continued onto the next line by escaping
the newline with ‘\’.
You can use |forward-comment| to move forward or backward over one
comment or several comments.
Next: Categories <#Categories>, Previous: Parsing Expressions
<#Parsing-Expressions>, Up: Syntax Tables <#Syntax-Tables> [Contents
<#SEC_Contents>][Index <#Index>]
35.7 Syntax Table Internals
Syntax tables are implemented as char-tables (see Char-Tables
<#Char_002dTables>), but most Lisp programs don’t work directly with
their elements. Syntax tables do not store syntax data as syntax
descriptors (see Syntax Descriptors <#Syntax-Descriptors>); they use an
internal format, which is documented in this section. This internal
format can also be assigned as syntax properties (see Syntax Properties
<#Syntax-Properties>).
Each entry in a syntax table is a /raw syntax descriptor/: a cons cell
of the form |(syntax-code . matching-char)|. syntax-code is an integer
which encodes the syntax class and syntax flags, according to the table
below. matching-char, if non-|nil|, specifies a matching character
(similar to the second character in a syntax descriptor).
Use |aref| (see Array Functions <#Array-Functions>) to get the raw
syntax descriptor of a character, e.g. |(aref (syntax-table) ch)|.
Here are the syntax codes corresponding to the various syntax classes:
/Code/ /Class/ /Code/ /Class/
0 whitespace 8 paired delimiter
1 punctuation 9 escape
2 word 10 character quote
3 symbol 11 comment-start
4 open parenthesis 12 comment-end
5 close parenthesis 13 inherit
6 expression prefix 14 generic comment
7 string quote 15 generic string
For example, in the standard syntax table, the entry for ‘(’ is |(4 .
41)|. 41 is the character code for ‘)’.
Syntax flags are encoded in higher order bits, starting 16 bits from the
least significant bit. This table gives the power of two which
corresponds to each syntax flag.
/Prefix/ /Flag/ /Prefix/ /Flag/
‘1’ |(ash 1 16)| ‘p’ |(ash 1 20)|
‘2’ |(ash 1 17)| ‘b’ |(ash 1 21)|
‘3’ |(ash 1 18)| ‘n’ |(ash 1 22)|
‘4’ |(ash 1 19)| ‘c’ |(ash 1 23)|
Function: *string-to-syntax* /desc/
Given a syntax descriptor desc (a string), this function returns the
corresponding raw syntax descriptor.
Function: *syntax-after* /pos/
This function returns the raw syntax descriptor for the character in
the buffer after position pos, taking account of syntax properties
as well as the syntax table. If pos is outside the buffer’s
accessible portion (see accessible portion <#Narrowing>), the return
value is |nil|.
Function: *syntax-class* /syntax/
This function returns the syntax code for the raw syntax descriptor
syntax. More precisely, it takes the raw syntax descriptor’s
syntax-code component, masks off the high 16 bits which record the
syntax flags, and returns the resulting integer.
If syntax is |nil|, the return value is |nil|. This is so that the
expression
(syntax-class (syntax-after pos))
evaluates to |nil| if |pos| is outside the buffer’s accessible
portion, without throwing errors or returning an incorrect code.
Previous: Syntax Table Internals <#Syntax-Table-Internals>, Up: Syntax
Tables <#Syntax-Tables> [Contents <#SEC_Contents>][Index <#Index>]
35.8 Categories
/Categories/ provide an alternate way of classifying characters
syntactically. You can define several categories as needed, then
independently assign each character to one or more categories. Unlike
syntax classes, categories are not mutually exclusive; it is normal for
one character to belong to several categories.
Each buffer has a /category table/ which records which categories are
defined and also which characters belong to each category. Each category
table defines its own categories, but normally these are initialized by
copying from the standard categories table, so that the standard
categories are available in all modes.
Each category has a name, which is an ASCII printing character in the
range ‘ ’ to ‘~’. You specify the name of a category when you define it
with |define-category|.
The category table is actually a char-table (see Char-Tables
<#Char_002dTables>). The element of the category table at index c is a
/category set/—a bool-vector—that indicates which categories character c
belongs to. In this category set, if the element at index cat is |t|,
that means category cat is a member of the set, and that character c
belongs to category cat.
For the next three functions, the optional argument table defaults to
the current buffer’s category table.
Function: *define-category* /char docstring &optional table/
This function defines a new category, with name char and
documentation docstring, for the category table table.
Here’s an example of defining a new category for characters that
have strong right-to-left directionality (see Bidirectional Display
<#Bidirectional-Display>) and using it in a special category table.
To obtain the information about the directionality of characters,
the example code uses the ‘bidi-class’ Unicode property (see
bidi-class <#Character-Properties>).
(defvar special-category-table-for-bidi
;; Make an empty category-table.
(let ((category-table (make-category-table))
;; Create a char-table which gives the 'bidi-class' Unicode
;; property for each character.
(uniprop-table
(unicode-property-table-internal 'bidi-class)))
(define-category ?R "Characters of bidi-class R, AL, or RLO"
category-table)
;; Modify the category entry of each character whose
;; 'bidi-class' Unicode property is R, AL, or RLO --
;; these have a right-to-left directionality.
(map-char-table
(lambda (key val)
(if (memq val '(R AL RLO))
(modify-category-entry key ?R category-table)))
uniprop-table)
category-table))
Function: *category-docstring* /category &optional table/
This function returns the documentation string of category category
in category table table.
(category-docstring ?a)
⇒ "ASCII"
(category-docstring ?l)
⇒ "Latin"
Function: *get-unused-category* /&optional table/
This function returns a category name (a character) which is not
currently defined in table. If all possible categories are in use in
table, it returns |nil|.
Function: *category-table*
This function returns the current buffer’s category table.
Function: *category-table-p* /object/
This function returns |t| if object is a category table, otherwise
|nil|.
Function: *standard-category-table*
This function returns the standard category table.
Function: *copy-category-table* /&optional table/
This function constructs a copy of table and returns it. If table is
not supplied (or is |nil|), it returns a copy of the standard
category table. Otherwise, an error is signaled if table is not a
category table.
Function: *set-category-table* /table/
This function makes table the category table for the current buffer.
It returns table.
Function: *make-category-table*
This creates and returns an empty category table. In an empty
category table, no categories have been allocated, and no characters
belong to any categories.
Function: *make-category-set* /categories/
This function returns a new category set—a bool-vector—whose initial
contents are the categories listed in the string categories. The
elements of categories should be category names; the new category
set has |t| for each of those categories, and |nil| for all other
categories.
(make-category-set "al")
⇒ #&128"\0\0\0\0\0\0\0\0\0\0\0\0\2\20\0\0"
Function: *char-category-set* /char/
This function returns the category set for character char in the
current buffer’s category table. This is the bool-vector which
records which categories the character char belongs to. The function
|char-category-set| does not allocate storage, because it returns
the same bool-vector that exists in the category table.
(char-category-set ?a)
⇒ #&128"\0\0\0\0\0\0\0\0\0\0\0\0\2\20\0\0"
Function: *category-set-mnemonics* /category-set/
This function converts the category set category-set into a string
containing the characters that designate the categories that are
members of the set.
(category-set-mnemonics (char-category-set ?a))
⇒ "al"
Function: *modify-category-entry* /char category &optional table reset/
This function modifies the category set of char in category table
table (which defaults to the current buffer’s category table). char
can be a character, or a cons cell of the form |(min . max)|; in the
latter case, the function modifies the category sets of all
characters in the range between min and max, inclusive.
Normally, it modifies a category set by adding category to it. But
if reset is non-|nil|, then it deletes category instead.
Command: *describe-categories* /&optional buffer-or-name/
This function describes the category specifications in the current
category table. It inserts the descriptions in a buffer, and then
displays that buffer. If buffer-or-name is non-|nil|, it describes
the category table of that buffer instead.
Next: Threads <#Threads>, Previous: Syntax Tables <#Syntax-Tables>, Up:
Top <#Top> [Contents <#SEC_Contents>][Index <#Index>]
36 Abbrevs and Abbrev Expansion
An abbreviation or /abbrev/ is a string of characters that may be
expanded to a longer string. The user can insert the abbrev string and
find it replaced automatically with the expansion of the abbrev. This
saves typing.
The set of abbrevs currently in effect is recorded in an /abbrev table/.
Each buffer has a local abbrev table, but normally all buffers in the
same major mode share one abbrev table. There is also a global abbrev
table. Normally both are used.
An abbrev table is represented as an obarray. See Creating Symbols
<#Creating-Symbols>, for information about obarrays. Each abbreviation
is represented by a symbol in the obarray. The symbol’s name is the
abbreviation; its value is the expansion; its function definition is the
hook function for performing the expansion (see Defining Abbrevs
<#Defining-Abbrevs>); and its property list cell contains various
additional properties, including the use count and the number of times
the abbreviation has been expanded (see Abbrev Properties
<#Abbrev-Properties>).
Certain abbrevs, called /system abbrevs/, are defined by a major mode
instead of the user. A system abbrev is identified by its non-|nil|
|:system| property (see Abbrev Properties <#Abbrev-Properties>). When
abbrevs are saved to an abbrev file, system abbrevs are omitted. See
Abbrev Files <#Abbrev-Files>.
Because the symbols used for abbrevs are not interned in the usual
obarray, they will never appear as the result of reading a Lisp
expression; in fact, normally they are never used except by the code
that handles abbrevs. Therefore, it is safe to use them in a nonstandard
way.
If the minor mode Abbrev mode is enabled, the buffer-local variable
|abbrev-mode| is non-|nil|, and abbrevs are automatically expanded in
the buffer. For the user-level commands for abbrevs, see Abbrev Mode
in The GNU Emacs Manual.
• Tables <#Abbrev-Tables> Creating and working with abbrev tables.
• Defining Abbrevs <#Defining-Abbrevs> Specifying abbreviations and
their expansions.
• Files <#Abbrev-Files> Saving abbrevs in files.
• Expansion <#Abbrev-Expansion> Controlling expansion; expansion
subroutines.
• Standard Abbrev Tables <#Standard-Abbrev-Tables> Abbrev tables used
by various major modes.
• Abbrev Properties <#Abbrev-Properties> How to read and set abbrev
properties. Which properties have which effect.
• Abbrev Table Properties <#Abbrev-Table-Properties> How to read and
set abbrev table properties. Which properties have which effect.
Next: Defining Abbrevs <#Defining-Abbrevs>, Up: Abbrevs <#Abbrevs>
[Contents <#SEC_Contents>][Index <#Index>]
36.1 Abbrev Tables
This section describes how to create and manipulate abbrev tables.
Function: *make-abbrev-table* /&optional props/
This function creates and returns a new, empty abbrev table—an
obarray containing no symbols. It is a vector filled with zeros.
props is a property list that is applied to the new table (see
Abbrev Table Properties <#Abbrev-Table-Properties>).
Function: *abbrev-table-p* /object/
This function returns a non-|nil| value if object is an abbrev table.
Function: *clear-abbrev-table* /abbrev-table/
This function undefines all the abbrevs in abbrev-table, leaving it
empty.
Function: *copy-abbrev-table* /abbrev-table/
This function returns a copy of abbrev-table—a new abbrev table
containing the same abbrev definitions. It does /not/ copy any
property lists; only the names, values, and functions.
Function: *define-abbrev-table* /tabname definitions &optional docstring
&rest props/
This function defines tabname (a symbol) as an abbrev table name,
i.e., as a variable whose value is an abbrev table. It defines
abbrevs in the table according to definitions, a list of elements of
the form |(abbrevname expansion [hook] [props...])|. These elements
are passed as arguments to |define-abbrev|.
The optional string docstring is the documentation string of the
variable tabname. The property list props is applied to the abbrev
table (see Abbrev Table Properties <#Abbrev-Table-Properties>).
If this function is called more than once for the same tabname,
subsequent calls add the definitions in definitions to tabname,
rather than overwriting the entire original contents. (A subsequent
call only overrides abbrevs explicitly redefined or undefined in
definitions.)
Variable: *abbrev-table-name-list*
This is a list of symbols whose values are abbrev tables.
|define-abbrev-table| adds the new abbrev table name to this list.
Function: *insert-abbrev-table-description* /name &optional human/
This function inserts before point a description of the abbrev table
named name. The argument name is a symbol whose value is an abbrev
table.
If human is non-|nil|, the description is human-oriented. System
abbrevs are listed and identified as such. Otherwise the description
is a Lisp expression—a call to |define-abbrev-table| that would
define name as it is currently defined, but without the system
abbrevs. (The mode or package using name is supposed to add these to
name separately.)
Next: Abbrev Files <#Abbrev-Files>, Previous: Abbrev Tables
<#Abbrev-Tables>, Up: Abbrevs <#Abbrevs> [Contents
<#SEC_Contents>][Index <#Index>]
36.2 Defining Abbrevs
|define-abbrev| is the low-level basic function for defining an abbrev
in an abbrev table.
When a major mode defines a system abbrev, it should call
|define-abbrev| and specify |t| for the |:system| property. Be aware
that any saved non-system abbrevs are restored at startup, i.e., before
some major modes are loaded. Therefore, major modes should not assume
that their abbrev tables are empty when they are first loaded.
Function: *define-abbrev* /abbrev-table name expansion &optional hook
&rest props/
This function defines an abbrev named name, in abbrev-table, to
expand to expansion and call hook, with properties props (see Abbrev
Properties <#Abbrev-Properties>). The return value is name. The
|:system| property in props is treated specially here: if it has the
value |force|, then it will overwrite an existing definition even
for a non-system abbrev of the same name.
name should be a string. The argument expansion is normally the
desired expansion (a string), or |nil| to undefine the abbrev. If it
is anything but a string or |nil|, then the abbreviation expands
solely by running hook.
The argument hook is a function or |nil|. If hook is non-|nil|, then
it is called with no arguments after the abbrev is replaced with
expansion; point is located at the end of expansion when hook is
called.
If hook is a non-|nil| symbol whose |no-self-insert| property is
non-|nil|, hook can explicitly control whether to insert the
self-inserting input character that triggered the expansion. If hook
returns non-|nil| in this case, that inhibits insertion of the
character. By contrast, if hook returns |nil|, |expand-abbrev| (or
|abbrev-insert|) also returns |nil|, as if expansion had not really
occurred.
Normally, |define-abbrev| sets the variable |abbrevs-changed| to
|t|, if it actually changes the abbrev. This is so that some
commands will offer to save the abbrevs. It does not do this for a
system abbrev, since those aren’t saved anyway.
User Option: *only-global-abbrevs*
If this variable is non-|nil|, it means that the user plans to use
global abbrevs only. This tells the commands that define
mode-specific abbrevs to define global ones instead. This variable
does not alter the behavior of the functions in this section; it is
examined by their callers.
Next: Abbrev Expansion <#Abbrev-Expansion>, Previous: Defining Abbrevs
<#Defining-Abbrevs>, Up: Abbrevs <#Abbrevs> [Contents
<#SEC_Contents>][Index <#Index>]
36.3 Saving Abbrevs in Files
A file of saved abbrev definitions is actually a file of Lisp code. The
abbrevs are saved in the form of a Lisp program to define the same
abbrev tables with the same contents. Therefore, you can load the file
with |load| (see How Programs Do Loading <#How-Programs-Do-Loading>).
However, the function |quietly-read-abbrev-file| is provided as a more
convenient interface. Emacs automatically calls this function at startup.
User-level facilities such as |save-some-buffers| can save abbrevs in a
file automatically, under the control of variables described here.
User Option: *abbrev-file-name*
This is the default file name for reading and saving abbrevs. By
default, Emacs will look for ~/.emacs.d/abbrev_defs, and, if not
found, for ~/.abbrev_defs; if neither file exists, Emacs will create
~/.emacs.d/abbrev_defs.
Function: *quietly-read-abbrev-file* /&optional filename/
This function reads abbrev definitions from a file named filename,
previously written with |write-abbrev-file|. If filename is omitted
or |nil|, the file specified in |abbrev-file-name| is used.
As the name implies, this function does not display any messages.
User Option: *save-abbrevs*
A non-|nil| value for |save-abbrevs| means that Emacs should offer
to save abbrevs (if any have changed) when files are saved. If the
value is |silently|, Emacs saves the abbrevs without asking the
user. |abbrev-file-name| specifies the file to save the abbrevs in.
The default value is |t|.
Variable: *abbrevs-changed*
This variable is set non-|nil| by defining or altering any abbrevs
(except system abbrevs). This serves as a flag for various Emacs
commands to offer to save your abbrevs.
Command: *write-abbrev-file* /&optional filename/
Save all abbrev definitions (except system abbrevs), for all abbrev
tables listed in |abbrev-table-name-list|, in the file filename, in
the form of a Lisp program that when loaded will define the same
abbrevs. Tables that do not have any abbrevs to save are omitted. If
filename is |nil| or omitted, |abbrev-file-name| is used. This
function returns |nil|.
Next: Standard Abbrev Tables <#Standard-Abbrev-Tables>, Previous: Abbrev
Files <#Abbrev-Files>, Up: Abbrevs <#Abbrevs> [Contents
<#SEC_Contents>][Index <#Index>]
36.4 Looking Up and Expanding Abbreviations
Abbrevs are usually expanded by certain interactive commands, including
|self-insert-command|. This section describes the subroutines used in
writing such commands, as well as the variables they use for communication.
Function: *abbrev-symbol* /abbrev &optional table/
This function returns the symbol representing the abbrev named
abbrev. It returns |nil| if that abbrev is not defined. The optional
second argument table is the abbrev table in which to look it up. If
table is |nil|, this function tries first the current buffer’s local
abbrev table, and second the global abbrev table.
Function: *abbrev-expansion* /abbrev &optional table/
This function returns the string that abbrev would expand into (as
defined by the abbrev tables used for the current buffer). It
returns |nil| if abbrev is not a valid abbrev. The optional argument
table specifies the abbrev table to use, as in |abbrev-symbol|.
Command: *expand-abbrev*
This command expands the abbrev before point, if any. If point does
not follow an abbrev, this command does nothing. To do the
expansion, it calls the function that is the value of the
|abbrev-expand-function| variable, with no arguments, and returns
whatever that function does.
The default expansion function returns the abbrev symbol if it did
expansion, and |nil| otherwise. If the abbrev symbol has a hook
function that is a symbol whose |no-self-insert| property is
non-|nil|, and if the hook function returns |nil| as its value, then
the default expansion function returns |nil|, even though expansion
did occur.
Function: *abbrev-insert* /abbrev &optional name start end/
This function inserts the abbrev expansion of |abbrev|, replacing
the text between |start| and |end|. If |start| is omitted, it
defaults to point. |name|, if non-|nil|, should be the name by which
this abbrev was found (a string); it is used to figure out whether
to adjust the capitalization of the expansion. The function returns
|abbrev| if the abbrev was successfully inserted, otherwise it
returns |nil|.
Command: *abbrev-prefix-mark* /&optional arg/
This command marks the current location of point as the beginning of
an abbrev. The next call to |expand-abbrev| will use the text from
here to point (where it is then) as the abbrev to expand, rather
than using the previous word as usual.
First, this command expands any abbrev before point, unless arg is
non-|nil|. (Interactively, arg is the prefix argument.) Then it
inserts a hyphen before point, to indicate the start of the next
abbrev to be expanded. The actual expansion removes the hyphen.
User Option: *abbrev-all-caps*
When this is set non-|nil|, an abbrev entered entirely in upper case
is expanded using all upper case. Otherwise, an abbrev entered
entirely in upper case is expanded by capitalizing each word of the
expansion.
Variable: *abbrev-start-location*
The value of this variable is a buffer position (an integer or a
marker) for |expand-abbrev| to use as the start of the next abbrev
to be expanded. The value can also be |nil|, which means to use the
word before point instead. |abbrev-start-location| is set to |nil|
each time |expand-abbrev| is called. This variable is also set by
|abbrev-prefix-mark|.
Variable: *abbrev-start-location-buffer*
The value of this variable is the buffer for which
|abbrev-start-location| has been set. Trying to expand an abbrev in
any other buffer clears |abbrev-start-location|. This variable is
set by |abbrev-prefix-mark|.
Variable: *last-abbrev*
This is the |abbrev-symbol| of the most recent abbrev expanded. This
information is left by |expand-abbrev| for the sake of the
|unexpand-abbrev| command (see Expanding Abbrevs